Method for lossless encoding of image data by approximating linear transforms and preserving selected properties for image processing

ABSTRACT

A method for generating a first plurality of output data values and the matrix factors used to generate an approximation to an image processing transform is disclosed. The first plurality of output data values are generated by transforming a plurality of input data values using a computer and applying a modified transform stored in a modified transformation matrix to the plurality of input data values. The plurality of input data values are stored in a generated matrix, and at least one data value in this matrix is rearranged using a permutation operation and modified by applying a linear combination of the unmodified values to the at least one data value. The modified transform is an approximation to a known transform stored in a transformation matrix that is used to generate a second plurality of output data values, the first plurality of output values approximating the second plurality of output data values. The modified transformation matrix is generated from a plurality of matrix factors that are generated by factoring the transformation matrix. The known transform and the modified transform approximating the known transform map the same integer data in the plurality of input data values to the same plurality of integer output data values.

CROSS-REFERENCES TO RELATED APPLICATIONS

[0001] This application claims the benefit of provisional U.S.Application Serial No. 60/250,829, filed Dec. 1, 2000; and provisionalU.S. Application Serial No. 60/250,850, filed Dec. 1, 2000, under 35U.S.C. §119.

FIELD OF THE INVENTION

[0002] This invention relates to image processing and, moreparticularly, to a method for approximating image processing transformswith invertible integer-to-integer transforms.

BACKGROUND OF THE INVENTION

[0003] Contemporary image processing methods apply image data transformsin a vast array of applications. In many cases, the transformationprocess causes a loss of data and frequently requires complex image datacompression and decompression circuitry for image transmission andrestoration. Lossless image transformation processes are generallyperceived as being computationally expensive and tend not to be used asoften as lossy transformation processes. Among the most widely usedimage data transforms are the wavelet transform and the color transform.

[0004] The processing of image data also requires sophisticated methodsfor manipulating data stored in multi-dimensional matrices. At times,such data can be highly position dependent and any processing on suchdata may require certain types of highly localized processing given theunique nature of the interactions between image data at neighboringlocations. These matrices may also be comprised of different types ofnumerical values. Frequently, such matrices include real number entriesthat are applied to an input data set that may also be comprised of realnumbers or other types of numerical data. In the event integer inputdata is provided, the output image data produced by a transformationmatrix comprised of real numbers may not necessarily provide the mostaccurate mapping of input data values to output data values.Furthermore, the reliable mapping of entire classes of numerical datamay require the processing of bounded and unbounded length input datavectors.

[0005] A reliable image data transformation process is needed that canbe used to map integer input data to integer output data without thepossible loss of data associated with the transformation of such inputdata by transformation matrices including real number entries. Thisprocess must be capable of being applied to the most widely usedtransforms for input data vectors that are bounded and unbounded inlength. The present invention is directed to providingcomputer-implemented methods for generating integer-to-integertransformation matrices that are approximations to known lineartransformation matrices.

SUMMARY OF THE INVENTION

[0006] In accordance with the present invention, a method for losslessencoding of image data involving the approximation of an invertiblelinear transformation by an integer-to-integer transformation isdisclosed. The invertible linear transformation may be applied tovectors of a fixed length or to vectors of unbounded length which are ofbounded amplitude. An approximation to an invertible lineartransformation is at times necessary when the input vectors are integersand the entries in a matrix including the transformation are realnumbers. The invention discloses methods for factoring a matrixincluding an invertible linear transformation into simpler matrices of aform which allows them to be approximated directly.

[0007] Among the types of matrices that may be generated by theapproximation methods are elementary matrices, permutation matrices anddiagonal matrices, each of which are simpler to process in an imageprocessing system. It is an object of the present invention to enablethe factored matrices to have certain useful properties. Among thevarious useful properties preserved by the computer-implemented matrixfactorization methods are their ability to maintain correct values atcertain inputs that a given linear transformation already maps tointegers. The methods can also be used to preserve local interactionsbetween coordinates in an input subspace that directly influenceneighboring data in an output subspace. The methods disclosed in thisinvention can be applied to well-known transforms such as theRGB→YC_(B)C_(R) color transform and the 9-7 wavelet transform.

[0008] In accordance with an aspect of the present invention, a methodis described for generating a first plurality of output data vales bytransforming a plurality of input data values using a computer. Thefirst plurality of output data values approximate a second plurality ofoutput data values. The second plurality of output data values aregenerated by applying a linear transform to the first plurality of inputdata values. The linear transform may be a finite one-dimensional lineartransform or a 9-7 wavelet transform. This method comprises at least onestep, the step being a rearrangement of at least one data value in aplurality of current input data values, a negation of at least one valuein the plurality of current input data values, a modification at leastone data value in the plurality of current input data values, or asuccessive combination of one or more of these steps.

[0009] In accordance with another aspect of the invention, a method forfactoring a linear transformation matrix storing a finiteone-dimensional linear transform is disclosed. The method requires theLU-decomposition of the linear transformation matrix, the generation offour matrices from the LU-decomposed matrix, the generation of a thirdmatrix and a signed permutation matrix, the generation of a permutedlinear transformation matrix, the computation of an upper triangularmatrix from the permuted linear transformation matrix, the factoring ofthe permuted linear transformation matrix, and the generation of matrixfactors from the linear transformation matrix.

[0010] In yet another aspect of the present invention, acomputer-implemented method is disclosed for generating a sequence ofmatrix factors for use in generating a matrix that approximates thetransformation matrix that stores a wavelet transform. This methodinvolves the application of at least one row reduction operation to thetransformation matrix and the generation of a sequence of matrix factorsfrom the transformation matrix after application of the at least one rowreduction operation.

[0011] As will be readily appreciated from the foregoing description,the invention provides several new methods for the lossless encoding ofimage data by use of an approximation matrix that can map integer inputdata to integer output data that is the same as that mapped to by atransformation matrix storing a known linear transform.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] The foregoing aspects and many of the attendant advantages ofthis invention will become more readily appreciated as the same becomebetter understood by reference to the following detailed description,when taken in conjunction with the accompanying drawings, wherein:

[0013]FIG. 1 is a flowchart illustrating a process for generating matrixfactors in accordance with the present invention;

[0014]FIG. 2 is a flowchart illustrating a process for transforming aplurality of input data values in accordance with the present invention;

[0015]FIG. 3 is a flowchart illustrating a process for generating matrixfactors for a finite-dimensional linear transform in accordance with thepresent invention;

[0016]FIG. 4 is a flowchart illustrating a process for generating matrixfactors for a wavelet transform stored in a transformation matrix inaccordance with the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

[0017] While the preferred embodiment of the invention has beenillustrated and described, it will be appreciated that various changescan be made therein without departing from the spirit and scope of theinvention.

Introduction

[0018] In various signal processing applications, one must apply aninvertible linear transformation. The input vectors are often given asintegers, while the transformation has real entries. In some situations,it is convenient to approximate the linear transformation by another(possibly nonlinear) invertible transformation which maps integervectors to integer vectors.

[0019] There are actually two versions of this problem. The firstversion exists when the given vectors have a fixed (probably small)length n. Given a linear map from R_(n) onto R_(n) specified by an n×nmatrix A, and a (probably nonlinear) bijection φ is needed from Z_(n) toZ_(n) which is close to A in the sense that ∥Ax−φx∥ is bounded forx∈Z_(n). (We will use the Euclidean norm, but one could use other normsas well. Note also the use of the words “length” for the number ofcoordinates in a vector, and “norm” for the Euclidean magnitude of thevector.) Equivalently, we have the standard integer lattice Z_(n) and alinearly transformed lattice AZ^(n), and we want to find a bijection ψfrom the transformed lattice to the original lattice which moves pointsas small a distance as possible (so ψ=φ·A¹). Please note that we areusing the same symbol A for the linear transformation and for itsassociated matrix.

[0020] In the second version of the problem, the input vectors aresignals which can have unbounded length, but are of bounded amplitude(i.e., the values appearing as coordinates of the vector are bounded).Both versions of the problem were described in co-pending and co-filedprior provisional U.S. Application Serial Nos. 60/250,829 and60/250,850, each of which are hereby incorporated by reference. Severalmethods providing practical solutions to these problems will bediscussed in this document. The image processing methods describedherein can be performed in various software systems.

[0021] For both versions of the problem, the goal is to find an integerbijection φ approximating the given transformation so that theapproximation error is bounded over all inputs, preferably with a smallbound. (Of course, there are other properties one would like φ to have,such as easy computability of φ and φ⁻¹.) As we will see, this ispossible only if the determinant is ±1 in the fixed-length case; thereis a similar restriction in the unbounded-length case. Even then it isnot obvious that one can get the error to be bounded. In thefixed-length case, one could try some sort of greedy algorithm whichinitially maps each point in the first lattice to the nearest point inthe second lattice and then goes through a correction process to resolvecollisions (two points mapped to the same point) and gaps (points in thesecond lattice not images of any point in the first lattice), but thecorrections might get worse and worse as more points are processed, andit is not at all clear that one can get a bounded-error bijection thisway.

Invertible integer-to-Integer Mappings

[0022] We start by showing that, in the fixed-length case, a necessarycondition for the existence of a bounded-error integer approximation φto the linear transformation A is that detA=±1. Suppose that such a φexists with error bound δ, and let ψ=φ·A⁻¹. Then for a large positiveinteger M, the points in the transformed lattice AZ^(n) within the cube[−M,M]^(n) are mapped by ψ to standard lattice points in the slightlylarger cube [−M−δ,M+δ]^(n), and all standard lattice points in thesmaller cube [−M+δ,M+δ]^(n) are reached in this way. So the number oftransformed lattice points in the cube [−M,M]^(n) must be(2M+1)^(n)+O(M^(n−1)) for large M; this implies that the determinant ofthe transformed lattice (i.e., det A) is ±1.

[0023] We may as well assume that the determinant is 1, because, if itis −1, we can negate a row of the matrix to change the determinant to+1. An integer approximation for the modified matrix easily yields aninteger approximation for the original matrix (just negate the specifiedcoordinate at the end).

[0024] The main approach we will use for integer approximations is todivide and conquer: if we have a linear transformation with no obvioussuitable integer approximation, then we factor the matrix into partswhich we do know how to approximate. The composition of theseapproximations of parts will be a suitable approximation to the entiretransformation. To see this, first consider the two-factor case: ifA=A₁A₂ and we have φ₁ and φ₂ approximating A₁ and A₂ so that∥A₁x−φ₁x∥≦C₁ and ∥A₂x−φ₂x∥≦C₂ for all x, then φ₁·φ₂ approximates A,because $\begin{matrix}{{{{A_{1}A_{2}x} - {\phi_{1}\phi_{2}x}}} \leq \quad {{{{A_{1}A_{2}x} - {A_{1}\phi_{2}x}}} + {{{A_{1}\phi_{2}x} - {\phi_{1}\phi_{2}x}}}}} \\{\leq \quad {{{A_{1}}{{{A_{2}x} - {\phi_{2}x}}}} + {{{A_{1}\phi_{2}x} - {\phi_{1}\phi_{2}x}}}}} \\{\leq \quad {{{A_{1}}C_{2}} + {C_{1}.}}}\end{matrix}$

[0025] We can iterate this: if A=A₁A₂ . . . A_(k) where each A_(i) canbe approximated by an integer mapping with error bound C_(i), then A canbe approximated by the composition of these integer mappings with errorbound

C₁+∥A₁∥C₂+∥A₁∥∥A₂∥C₃+ . . . +∥A₁∥∥A₂∥ . . . ∥A_(k−1)∥C_(k).  (1.1)

[0026] If one does the whole computation here at once, rather thaniteratively, one gets a slightly improved form of the bound:

C₁+∥A₁∥C₂+∥A₁A₂∥C₃+ . . . +∥A₁A₂ . . . A_(k−1)∥C_(k).  (1.2)

[0027] Since the goal here is to produce invertible integerapproximations to invertible linear transformations, we will also beinterested in error estimates for the inverse transform: we will want abound on ∥A⁻¹y−φ⁻¹y∥ over all integer vectors y. This bound will not ingeneral be the same as the bound for the forward transform, but it isclosely related: for any such y, if we let x=φ⁻¹y, then $\begin{matrix}\begin{matrix}{{{{A^{- 1}y} - {\phi^{- 1}y}}} = \quad {{{A^{- 1}\phi \quad x} - x}}} \\{= \quad {{{A^{- 1}\phi \quad x} - {A^{- 1}A\quad x}}}} \\{= \quad {{A^{- 1}\left( {{\phi \quad x} - {Ax}} \right)}}} \\{\leq \quad {{A^{- 1}}{{{Ax} - {\phi \quad x}}}}}\end{matrix} & (1.3)\end{matrix}$

[0028] A similar computation gives ∥Ax−φx∥≦∥A∥∥A⁻¹y−φ⁻¹y∥, so∥A⁻¹y−φ⁻¹y∥≧∥A∥⁻¹∥Ax−φx∥.

[0029] Formulas such as (1.2) indicate that, if multiple factorizationsof a given transformation are available, then the ones with fewerfactors are likely to have better error bounds (assuming that the errorbounds and norms for the factors in the factorizations are similar).

[0030] A special factor that will occur frequently in factorizations andis easy to handle is a permutation matrix which merely rearrangescoordinates or bands. In fact, we can generalize this by allowing someof the 1's in the permutation matrix to be changed to −1's (negatingsome coordinates or bands). This will be needed because permutationmatrices can have determinant −1 and we usually want to restrict tomatrices of determinant 1. We will often refer to such signedpermutation matrices as ‘permutation’ matrices.

[0031] Such a matrix is normally a “free” factor: it is approximable byan integer mapping with error 0, and its norm is 1, so it does notinflate the errors from the other factors in formula (1.1). In fact,since this matrix gives an isometry both of R^(n) and of Z^(n), itcannot have any effect on the error bounds from a factorization if itoccurs first or last in that factorization. If it occurs in the middle,then it might have a slight effect on the error by affecting therelation between the two factors it lies between.

[0032] Another type of factor which will be fundamental in the followingis an elementary matrix, which differs from the identity matrix only ata single off diagonal entry. Applying such a matrix has the effect ofadding a multiple of one coordinate to another coordinate.

[0033] If A is a 2×2 elementary matrix, say ${A = \begin{pmatrix}1 & a_{12} \\0 & 1\end{pmatrix}},{{{so}\quad {that}\quad {A\begin{pmatrix}x_{1} \\x_{2}\end{pmatrix}}} = \begin{pmatrix}{x_{1} + {a_{12}x_{2}}} \\x_{2}\end{pmatrix}},{{then}\quad {let}}$ ${{\phi \begin{pmatrix}x_{1} \\x_{2}\end{pmatrix}} = {\begin{pmatrix}{\langle{x_{1} + {a_{12}x_{2}}}\rangle} \\x_{2}\end{pmatrix} = \begin{pmatrix}{x_{1} + {\langle{a_{12}x_{2}}\rangle}} \\x_{2}\end{pmatrix}}},$

[0034] where <y> is y rounded to an integer in a consistent way (say,round to nearest with half-integers rounded upward) so that, for anyinteger n and real number y, y, <n+y>=n+<y>. (Consistency is needed onlyif we want to think of the mapping φ as “apply A and then round allcoordinates to integers.” If we are willing to forget about <x₁+a₁₂x₂>and go straight to x₁+<a₁₂x₂>, then any rounding function can be used.)Then we have ∥Ax−φx∥≦½. And φ is invertible: if ${{\phi \begin{pmatrix}x_{1} \\x_{2}\end{pmatrix}} = \begin{pmatrix}y_{1} \\y_{2}\end{pmatrix}},$

[0035] then x₂=y₂ and x₁=y₁−<a₁₂x₂>. (Note that y₁−<a₁₂x₂> mightoccasionally be different from y₁+<−a₁₂x₂>. In other words, the inverseof the integer approximation of A need not be the same as the integerapproximation of the inverse of A, because the rounding is done slightlydifferently. However, for elementary matrices the differences should berare, occurring only when the number to be rounded is equal to or verynear a half-integer.)

[0036] We will see in the section entitled “Larger Matrices” that unittriangular matrices (i.e., lower or upper triangular matrices whosediagonal entries are all 1) are as suitable as elementary matrices forthe purpose of obtaining integer approximations.

[0037] One can think of the usual full Gaussian elimination process asfactoring a matrix into elementary matrices, simple permutation matrices(for pivoting), and a diagonal matrix. For most applications of suchfactorizations, the elementary matrices are the ones requiringattention, while the permutations and diagonal matrices are trivial andcan be ignored. The present situation is an exception; elementarymatrices (and permutation matrices) are easy to handle directly, butdiagonal matrices are not. We will investigate the 2×2 diagonal matricesextensively in the next section.

[0038] The linear transformations and the corresponding integerapproximations may change the range that the coordinates vary over—anapproximation which maps integers to integers need not map 16-bitintegers to 16-bit integers. It is easy to determine the new rangesafter the linear transformation: if the transformation is given byA=(a_(ij))_(n,n) and input coordinate number j is bounded in absolutevalue by b_(j) for each j, then output coordinate number i is bounded inabsolute value by $\sum\limits_{j = 1}^{n}\quad {{a_{ij}}{b_{j}.}}$

[0039] . (Similar bounds can be computed if the input coordinates arerestricted to intervals not symmetric around 0.)

[0040] Since the integer approximation is supposed to be within a fixeddistance of the linear transformation, one can easily adjust thesebounds to get bounds for the approximation. (However, intermediateresults may not lie within these bounds; one may need to compute rangesfor the individual factor matrices in order to bound these.)

The 2×2 Diagonal Matrix

[0041] As mentioned in the previous section, diagonal matrices, whichare trivial for most applications, are quite nontrivial when it comes tointeger approximations; here we will factor them into matrices which canbe approximated directly. We may assume that the given diagonal matrixhas determinant 1. Furthermore, if we have an n×n diagonal matrix ofdeterminant 1, we can factor it into simpler diagonal matrices ofdeterminant 1 each having only two nontrivial diagonal entries: the n=3case is $\begin{pmatrix}d_{1} & 0 & 0 \\0 & d_{2} & 0 \\0 & 0 & d_{3}\end{pmatrix} = {\begin{pmatrix}d_{1} & 0 & 0 \\0 & d_{1}^{- 1} & 0 \\0 & 0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 & 0 \\0 & {d_{1}d_{2}} & 0 \\0 & 0 & d_{3}\end{pmatrix}}$

[0042] and larger matrices are handled similarly. So we can concentrateon the determinant—1 2×2 diagonal matrix. $D = {\begin{pmatrix}\alpha & 0 \\0 & \alpha^{- 1}\end{pmatrix}.}$

[0043] We may assume α>0, since otherwise we can just pull out thescaling factor $\begin{pmatrix}{- 1} & 0 \\0 & {- 1}\end{pmatrix},$

[0044] which is “free” (i.e., just negate everything at the end).

[0045] We can factor D into four elementary matrices: $\begin{matrix}{D = {\begin{pmatrix}1 & r \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\s & 1\end{pmatrix}\begin{pmatrix}1 & {{- r}\quad \alpha^{- 1}} \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\{{- s}\quad \alpha} & 1\end{pmatrix}}} & (2.1)\end{matrix}$

[0046] where r and s are chosen so that rs+1=α, or $\begin{matrix}{D = {\begin{pmatrix}1 & 0 \\s & 1\end{pmatrix}\begin{pmatrix}1 & r \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\{{- s}\quad \alpha} & 1\end{pmatrix}\begin{pmatrix}1 & {{- r}\quad \alpha^{- 1}} \\0 & 1\end{pmatrix}}} & (2.2)\end{matrix}$

[0047] where rs+1=α.⁻¹ Any such factorization leads to a bounded-errorinteger approximation for D. A plausible choice for r and s would be to“balance” the factors by requiring |r|=|sα|, but it is not clear thatthis will minimize the error bound. The factorizations appearing in theprior art are (2.1) with r=α−α² and (2.2) with s=−1.

[0048] Or we can factor D into three elementary matrices and a‘permutation’ matrix: $\begin{matrix}{{D = {\begin{pmatrix}0 & 1 \\{- 1} & 0\end{pmatrix}\begin{pmatrix}1 & 0 \\\alpha & 1\end{pmatrix}\begin{pmatrix}1 & {- \alpha^{- 1}} \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\\alpha & 1\end{pmatrix}}},} & (2.3) \\{D = {\begin{pmatrix}0 & {- 1} \\1 & 0\end{pmatrix}\begin{pmatrix}1 & \alpha^{- 1} \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\{- \alpha} & 1\end{pmatrix}{\begin{pmatrix}1 & \alpha^{- 1} \\0 & 1\end{pmatrix}.}}} & (2.4)\end{matrix}$

[0049] One can modify these so as to have the ‘permutation’ matrixappear at a different place in the factorization; this will not affectthe error bounds we obtain here.

[0050] The factorizations (2.3) and (2.4) are closely related: one canget (2.4) from (2.3) by interchanging the two coordinates throughout andreplacing α with α⁻¹. Note that “interchanging the two coordinates” isequivalent to conjugation by the reverse-diagonal matrix$J = {\begin{pmatrix}0 & 1 \\1 & 0\end{pmatrix}.}$

[0051] Similarly, one gets (2.2) from (2.1) by interchanging the twocoordinates, replacing α with α⁻¹, and interchanging r and s.

[0052] Note that these error bounds apply to D in isolation. If thediagonal matrix is just one part of a larger transformation, then whenthe parts are put together one can often combine factors such asadjacent elementary matrices with the nonzero entry in the samelocation; this normally will reduce the resulting error.

[0053] The factorizations (2.1)-(2.4) of D can easily be inverted togive factorizations of D l into elementary factors. As noted earlier,the integer approximations for these inverse factorizations are notquite the same as the inverses of the integer approximations for thedirect factorizations. However, in the 2×2 case the differences onlyshow up when a result lies exactly halfway between two integers and mustbe rounded to one of them (assuming rounding to the nearest integer).Since the analysis here will not depend on how such choices are made, wecan do error analysis of the inverse factorizations to get error boundsfor the inverse transformation.

[0054] It turns out that we do not need to do any additional work to getthe bounds for the inverse transformation here. The inverse of (2.3) is:$\begin{matrix}{D^{- 1} = {\begin{pmatrix}1 & 0 \\\alpha & 1\end{pmatrix}^{- 1}\begin{pmatrix}1 & {- \alpha^{- 1}} \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\\alpha & 1\end{pmatrix}^{- 1}\begin{pmatrix}0 & 1 \\{- 1} & 0\end{pmatrix}^{- 1}}} \\{= {\begin{pmatrix}1 & 0 \\{- \alpha} & 1\end{pmatrix}\begin{pmatrix}1 & {- \alpha^{- 1}} \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\{- \alpha} & 1\end{pmatrix}{\begin{pmatrix}0 & {- 1} \\1 & 0\end{pmatrix}.}}}\end{matrix}$

[0055] Conjugating this by the matrix $\quad\begin{pmatrix}1 & 0 \\0 & {- 1}\end{pmatrix}$

[0056] (an isometry which will not affect error bounds and whichcommutes with the given diagonal matrix) gives the formula$D^{- 1} = {\begin{pmatrix}1 & 0 \\\alpha & 1\end{pmatrix}\begin{pmatrix}1 & {- \alpha^{- 1}} \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\\alpha & 1\end{pmatrix}{\begin{pmatrix}0 & 1 \\{- 1} & 0\end{pmatrix}.}}$

[0057] This is exactly the same as (2.3) except that the ‘permutation’occurs at the left (the end) instead of the right (the beginning). Itfollows that any error bound we obtain for the forward transform for(2.3) will also be a bound for the error in the inverse transform.

[0058] The same reasoning shows that any forward transform error boundfor (2.4) is also an inverse transform error bound. The inverse of (2.1)turns out to be (2.2) with α, r, s replaced with α⁻¹, sα, rα⁻¹respectively, so we can obtain inverse transform error, bounds for (2.1)from forward transform error bounds for (2.2); similarly, forwardtransform error bounds for (2.1) give inverse transform error bounds for(2.2).

[0059] One can work out that ${\begin{pmatrix}1 & r \\0 & 1\end{pmatrix}} = {{\begin{pmatrix}1 & 0 \\r & 1\end{pmatrix}} = {\frac{{r} + \sqrt{r^{2} + 4}}{2}.}}$

[0060] Using this and the combination formula (1.1), one can compute anerror bound for the integer mappings coming from each of thefactorizations (2.1)-(2.4). (Each of the 2×2 elementary matrices has anerror bound of ½ unless it has integer entries, in which case it haserror bound 0.) However, the formulas for these bounds are rather messy;for instance, the error bound for (2.3) is$\frac{1}{8}{\left( {4 + {\left( {2 + \alpha^{- 1} + \sqrt{4 + \alpha^{- 2}}} \right)\left( {\alpha + \sqrt{4 + \alpha^{- 2}}} \right)}} \right).}$

[0061] As a useful example, let us consider the special case α={squareroot}{square root over (2)}. In this case the error bounds for (2.3) and(2.4) become respectively${{\frac{1}{4}\left( {2 + {\left( {\sqrt{3} + 1} \right)\left( {\sqrt{2} + 2} \right)}} \right)} \approx 2.8319512300735069},{{\frac{1}{2}\left( {2 + \sqrt{2} + \sqrt{3}} \right)} \approx {2.5731321849709862.}}$

[0062] For (2.1) and (2.2) with α={square root}{square root over (2)},the error bound C is still a messy function, this time of the parameterr (from which s is computable: s=(α−1)r or s=(α⁻¹−1)/r. Note that we mayassume r>0, since negating r and s yields another valid factorizationwithout changing the norms of the factors. As previously mentioned, aplausible choice of r is r={square root}{square root over(2−2)}≈0.7653668647301795 for (2.1) and r={square root}{square root over(2−1)}≈0.6435942529055826 for (3.2); these yield error bounds of about3.4167456510765178 and 3.1662988562277977, respectively. Numericaloptimization on r yields the following values:

[0063] few for (2.1): r≈0.4811840837330633, C≈3.2403445213547604;

[0064] for (2.2): r≈0.8035642481527317, C≈3.1303571980433588.

[0065] Using more detailed calculations, we can get error estimates moreprecise than those obtained from norms of matrices. Look at a particularelementary factor matrix A_(i). If the nonzero off diagonal entry ofA_(i) is in the upper right, and if φ_(i)x is the result of roundingA_(i)x to integer coordinates, then for any integer vector x_(i) theerror vector φ_(i)x_(i)−A_(i)x_(i) will have the form (e_(i), 0), where|e_(i)|≦½. Combining this for all factors in a product A=A₁A₂ . . .A_(k), we get

d=φ ₁φ₂ . . . φ_(k) x−Ax=d ₁ +A ₁ d ₂ +A ₁ A ₂ d ₃ + . . . +A ₁A₂ . . .A_(k−1) d _(k),

[0066] where

d _(i)=φ_(i)φ_(i)+1φ_(i)+2 . . . φ_(k) x−A _(i)φ_(i)+2 . . . φ_(k) x

[0067] is of the form (e_(i), 0) if A_(i) is elementary with upper rightentry nonzero, d_(i)=(0, e_(i)) if A_(i) is elementary with lower leftentry nonzero, and d_(i) is the zero vector (i.e., term number i in theerror bound is omitted) if A_(i) is an integer matrix. We can now findthe maximum possible value of ∥d∥ subject to the constraint that|e_(i)|≦½ for all i, and this will give an error bound for the integerapproximation of A.

[0068] For the factorization (2.3), we get $d = {\begin{pmatrix}{e_{2} + {e_{3}\alpha}} \\{{- e_{3}} + {e_{4}\alpha^{- 1}}}\end{pmatrix}.}$

[0069] Clearly we maximize ∥d∥ by letting e₂, e₃, e₄ all have absolutevalue ½, with e₂ and e₃ having the same sign and e₄ having the oppositesign. This gives the error bound${d} \leq {\frac{1}{2}\left( {1 + \alpha^{- 1}} \right){\sqrt{1 + \alpha^{2}}.}}$

[0070] Because of the known relation between (2.3) and (2.4), we get theerror bound for (2.4) by replacing α with α⁻¹ in the error bound for(2.3):${d} \leq {\frac{1}{2}\left( {1 + \alpha} \right){\sqrt{1 + \alpha^{- 2}}.}}$

[0071] These two bounds are actually equal. In the case α={squareroot}{square root over (2)}, the common value of the bound is${{\frac{1}{4}\sqrt{3}\left( {2 + \sqrt{2}} \right)} \approx 1.4783978394802332}..$

[0072] For (2.1) we have $d = {\begin{pmatrix}{e_{1} + {e_{2}r} + {e_{3}\alpha}} \\{e_{2} + {e_{3}s} + {e_{4}\alpha^{- 1}}}\end{pmatrix}.}$

[0073] This leads to case distinctions based on the signs of α−1, r, ands. As before, we may assume that r>0; this means that α−1 and s have thesame sign, since rs=α−1.

[0074] If α>1, and hence s>0, then clearly ∥d∥ is maximized when theerrors e_(i) are all ½ or all −½. So the error bound is:${d} \leq {\frac{\sqrt{\left( {r\quad {\alpha \left( {r + \alpha + 1} \right)}} \right)^{2} + \left( {{r\left( {\alpha + 1} \right)} + {\alpha \left( {\alpha - 1} \right)}} \right)^{2}}}{2r\quad \alpha}.}$

[0075] One can actually work out the critical points of this function ofr (holding a fixed) by solving a fourth-degree polynomial equation, butit is probably more convenient to find the optimal value of rnumerically.

[0076] If α<1 (so s<0), then the choice of signs for the errors e_(i) ismuch less clear; aligning them to make one component of d maximal willcause cancellation in the other component. One must consider the variouspossibilities to see which yields the longest error vector for givenvalues of a and r. (It is easy to see that e_(i) should be ±½, becausethe most distant point from the origin on a line segment is always oneof the two endpoints; only the signs of the numbers e_(i) are unknown.)

[0077] The situation for (2.2) is reversed: for α<1 one can get a singleformula for the maximal error, but for α>1 one must look at variouscases.

[0078] Again consider the example α={square root}{square root over (2)}.For (2.1) we have a single error formula and can proceed directly tonumerical optimization to find that the best value for r is about0.5789965414556075, giving an error bound of about 1.9253467944884184.For (2.2), the error bound is the maximum of four separate formulas; itturns out that this is minimized where two of the formulas cross eachother, at r={square root over (22)}−2/2≈0.4550898605622273,, and theerror bound is {square root}{square root over(12+182)}/4≈1.5300294956861884.

[0079] We still have to consider special values of r and s where one ofthe four matrices in (2.1) or (2.2) is integral, and hence thecorresponding e_(i) becomes 0. Among these are several cases giving anerror bound matching the value$\frac{1}{4}\sqrt{3}\left( {2 + \sqrt{2}} \right)$

[0080] from (2.3), and two cases which give even better bounds: puttingr={square root}{square root over (2)} in (2.2) gives an error bound of{square root}{square root over (21+82)}/4≈1.4211286997265756, andputting r=1−1/{square root}{square root over (2)} in (2.2) gives anerror bound of {square root}{square root over(6+22)}/2≈1.3614526765897057.

[0081] Even this does not exhaust the error analysis for (2.1)-(2.4).The error bounds obtained above are not sharp, because the errors e_(i)are not actually independent of each other. For instance, in thecomputation for (2.3), e₂ is not independent of e₃ and e₄: one can showthat e₂+αe₃−e₄ must be an integer. (If we start with an integer vectorx=(x1, x2), then the second component of φ₄x is b=x₂+x₁α+e₄ and thesecond component of φ₂φ₃φ₄x is b′=x₁α+e₃α+e₂; these are both integers,so b′−b+x₂=e₂+αe₃−e₄ is an integer.) Using this, we can get thefollowing error bound: ${d} \leq \left\{ \begin{matrix}{{{\frac{1}{2}\sqrt{\left( {\alpha + 1} \right)^{2} + {\alpha^{- 2}\left( {{2\left\lceil {\alpha/2} \right\rceil} - 1} \right)}^{2}}\quad {if}\quad \alpha} > 1},} \\{{\frac{1}{2}\sqrt{\left( {\alpha^{- 1} + 1} \right)^{2} + \left( {{2\left\lceil {\alpha/2} \right\rceil} - 1} \right)^{2}}\quad {if}\quad \alpha} > 1.}\end{matrix} \right.$

[0082] Replace α with α⁻¹ to get the corresponding error bound for(2.4). For α={square root}{square root over (2)} the error bounds for(2.3) and (2.4) are {square root}{square root over(14+82)}/4≈1.2578182623839374 and {square root}{square root over(4+22)}/2≈1.3065629648763765, respectively.

[0083] Such improvements for (2.1) and (2.2) are available only forspecial values of r and s, and depend highly on the specific form ofthese numbers and of a. For instance, in (2.2) for α={squareroot}{square root over (2)} and r=1−1/{square root}{square root over(2)}, this method gives an error bound of {square root}{square root over(4+22)}/2, the same as for (3.4) above.

[0084] These final error bounds for (2.3) and (2.4) are provably sharpwhen α is irrational, as is the above bound for the instancer=1−1/{square root}{square root over (2)} of (2.2). So (2.3) appears togive the best results among these methods when α={square root}{squareroot over (2)}.

[0085] If α is rational (and, in the case of (2.1) and (2.2), theparameters r and s are also rational), then the errors from the integerapproximation are periodic in both coordinates, so one can perform afinite computation to get the exact error bound for a particularfactorization.

[0086] One can obtain other integer approximation methods for rational αby constructing a finite configuration on a rectangle in the plane andextending ‘periodically’ to get the full mapping. Even for irrational α,where a ‘periodic’ solution is not available, one can still use integerapproximation methods completely different from those obtained viafactorization into elementary matrices.

Larger Matrices

[0087] As noted before, one can use Gaussian elimination to factor ann×n matrix of determinant 1 into elementary matrices, permutationmatrices (or ‘permutation’ matrices of determinant 1), and a diagonalmatrix of determinant ±1; we may assume that the determinant of thediagonal matrix is 1, because we can transfer a negation to one of the‘permutation’ factors. A diagonal matrix of determinant 1 can befactored into simpler diagonal matrices which have only two entriesdifferent from 1, these entries being reciprocals of each other; thesesimpler matrices can then be factored as in (2.1)-(2.4). So we know thatany matrix of determinant 1 can be factored into integer-approximablefactors.

[0088] But this process would yield a very large number of factors. Thenumber of factors can be drastically reduced if we work with a family offactor matrices more general than the elementary matrices but stillallowing easy bounded-error integer approximations. The matrices we willuse here are the unit triangular matrices, which are (lower or upper)triangular matrices whose diagonal entries are all 1. (Note that anyelementary matrix is unit triangular.) FIG. 2 illustrates the processfollowed to generate the elementary matrices, permutation matrices anddiagonal matrix used in the present invention.

[0089] Suppose we have a unit upper triangular matrix${U = \begin{pmatrix}1 & a_{12} & a_{13} & \cdots & a_{1n} \\0 & 1 & a_{12} & \cdots & a_{2n} \\0 & 0 & 1 & \cdots & a_{3n} \\\vdots & \vdots & \vdots & ⋰ & \vdots \\0 & 0 & 0 & \cdots & 1\end{pmatrix}},$

[0090] so that ${{U\begin{pmatrix}{x\quad 1} \\{x\quad 2} \\{x\quad 3} \\\vdots \\x_{n}\end{pmatrix}} = \begin{pmatrix}{x_{1} + {a_{12}x_{2}} + {a_{13}x_{3}} + \cdots + {a_{1n}x_{n}}} \\{x_{2} + {a_{23}x_{3}} + \cdots + {a_{2n}x_{n}}} \\{x_{3} + \cdots + {a_{3n}x_{n}}} \\\vdots \\x_{n}\end{pmatrix}},$

[0091] Then we can approximate U by ${\phi \begin{pmatrix}x_{1} \\x_{2} \\\vdots \\x_{n}\end{pmatrix}} = {\begin{pmatrix}{x_{1} + {\langle{{a_{12}x_{2}} + {a_{13}x_{3}} + \cdots + {a_{1n}x_{n}}}\rangle}} \\{x_{2} + {\langle{{a_{23}x_{3}} + \cdots + {a_{2n}x_{n}}}\rangle}} \\\vdots \\x_{n}\end{pmatrix}.}$

[0092] This will give ∥Ux−φx∥≦{square root}{square root over (n−1)}/2for all integer vectors x. And φ is invertible:

[0093] if ${{\phi \begin{pmatrix}x_{1} \\x_{2} \\\vdots \\x_{n}\end{pmatrix}} = \begin{pmatrix}y_{1} \\y_{2} \\\vdots \\y_{n}\end{pmatrix}},$

[0094] then $\begin{matrix}{{x_{n} = \quad y_{n}},} \\{{x_{n - 1} = \quad {y_{n - 1} - {\langle{a_{{n - 1},n}x_{n}}\rangle}}},} \\{{x_{n - 2} = \quad {y_{n - 2} - {\langle{{a_{{n - 2},{n - 1}}x_{n - 1}} + {a_{{n - 2},n}x_{n}}}\rangle}}},} \\{\quad \vdots} \\{x_{1} = \quad {y_{1} - {{\langle{{a_{12}x_{2}} + \cdots + {a_{1n}x_{n}}}\rangle}.}}}\end{matrix}$

[0095] Note that φ can be computed in place (output entries overwritinginput entries) if the output entries are computed in the order y₁, y₂, .. . , . . . y_(n); φ⁻¹ can also be computed in place, with the resultscomputed downward from x_(n) to x₁.

[0096] Again we find that φ⁻¹ is not the same as the integerapproximation to the matrix U⁻¹ (which is also unit upper triangular).The difference here is more substantial than in the 2×2 case. Forinstance, if n=3 and we apply the approximation for U and then theapproximation for U⁻¹ to the starting integer vector (x₁, x₂, x₃), thefirst coordinate of the result will be

x₁+<a₁₂x₂+a₁₃x₃>+<−a₁₂x₂−a₁₃x₃+a₁₂ (a₂₃x₃−<a₂₃x₃>)>,

[0097] which is quite likely to be different from x₁ even withoutboundary effects in the rounding rule. (In fact, the recursion displayedabove in the computation of φ⁻¹ tends to result in larger error boundsfor the inverse transform than for the forward transform.)

[0098] The same approximation method works for a unit lower triangularmatrix; one merely has to compute the output coordinates in the reverseorder. (Again, one can convert between upper triangular and lowertriangular using conjugation by a reverse-diagonal matrix J.) Actually,there are variants where the coordinates are computed in any specifiedorder; these are obtained by combining a unit triangular matrix with a‘permutation’ matrix.

[0099] Since a general matrix of determinant 1 can be factored intoelementary matrices, it certainly can be factored into unit triangularmatrices. The main question now is how many unit triangular factors arerequired in general. A quick lower bound can be obtained by countingdegrees of freedom (free parameters). The general matrix of determinant1 has n²−1 degrees of freedom. A unit triangular matrix has (n²−n)/2degrees of freedom; hence, at least three such factors are needed tohandle the general matrix (assuming n>1).

[0100] Note that a product of unit upper triangular matrices is unitupper triangular, and a product of unit lower triangular matrices isunit lower triangular. So, in a factorization into unit triangularmatrices, we may assume that the two types of matrix alternate.

[0101] We just saw that a product of two unit triangular matrices is notgeneral enough to give an arbitrary matrix of determinant 1. It turnsout that the family of matrices that can be expressed as a product oftwo unit triangular matrices, say in the order LU, is an interestingone. One of ordinary skill in the art would be capable of expressingmatrices as a product of two unit triangular matrices since such resultsare related to the method of expressing the diagonal entries of theupper triangular factor in a standard LU-decomposition as quotients ofdeterminants of leading square submatrices of the original matrix.

[0102] Proposition 3.1.

[0103] An n×n matrix A=(a_(ij))_(n,n) can be expressed in the form LU (Lunit lower triangular, U unit upper triangular) if and only if, for eachk≦n, the leading k×k submatrix of A (i.e., the upper left k×k submatrixof A, or (a_(ij))_(k,k), or A

k×k) has determinant 1.

[0104] Proof.

[0105] (

) It is easy to see from the special forms of L and U that (LU)

k×k=(L

k×k) (U

k×k) for any k≦n. Since L

k×k and U

k×k obviously have determinant 1, (LU)

k×k must have determinant 1.

[0106] (

) Suppose A has the specified leading-submatrix property. If we expressthe unknown L and U as (b_(ij))_(n,n) and (c_(ij))_(n,n) (sob_(ii)=c_(ii)=1, b_(ij)=0 for i<j, and c_(ij)=0 for i>j), then LU worksout to be $\begin{pmatrix}1 & c_{12} & c_{13} & c_{14} & \cdots \\b_{21} & {1 + {b_{21}c_{12}}} & {c_{23} + {b_{21}c_{13}}} & {c_{24} + {b_{21}c_{14}}} & \cdots \\b_{31} & {b_{32} + {b_{31}c_{12}}} & {1 + {b_{32}c_{23}} + {b_{31}c_{13}}} & {c_{34} + {b_{32}c_{24}} + {b_{31}c_{14}}} & \cdots \\b_{41} & {b_{42} + {b_{41}c_{12}}} & {b_{43} + {b_{42}c_{23}} + {b_{41}c_{13}}} & {1 + {b_{43}c_{34}} + {b_{42}c_{24}} + {b_{41}c_{14}}} & \cdots \\\vdots & \vdots & \vdots & \vdots & ⋰\end{pmatrix}.$

[0107] So we can set b_(i1) and c_(i1) so that the entries of LU in thefirst column (below the diagonal) and the first row (right of thediagonal) will match the corresponding entries of A. Then we can setb_(i2) and c_(2i) so that the remaining off-diagonal entries in thesecond row and column of LU match those of A. Continuing this way, weobtain matrices L and U of the required form so that all off-diagonalentries of LU match the corresponding entries of A.

[0108] Using the fact that A and LU both have the property that allleading square submatrices have determinant 1, we now show by inductionon k that the k'th diagonal entry of LU is a_(kk). Suppose this is truefor all k′<k. Then A

k×k and (LU)

k×k agree except possibly at the lower right entry. If we treat a_(kk)as an unknown, then the equation det(A

k×k)=1 is a linear equation for a_(kk) and the coefficient of a_(kk) isdet(A

(k−1)×(k−1))=1, so a_(kk) is uniquely determined. The lower right entryof (LU)

k×k satisfies exactly the same equation, so it must be equal to a_(kk).

[0109] Therefore, A=LU, as desired.

[0110] The right-to-left direction could be proved more briefly asfollows: given the matrix A, perform the standard LU-decomposition (nopivoting is needed) to write A as a product of a unit lower triangularmatrix and a general upper triangular matrix; then, from the above, thediagonal entries of the second factor will all be 1. A more explicitproof was presented above to show the simplifications that arise in thisspecial case.

[0111] In fact, given a suitable matrix A=(a_(ij))_(n,n), there is aquite simple algorithm to compute matrices L=(b_(ij))_(n,n) andU=(c_(ij))_(n,n) as above. Start by setting x_(ij)←a_(ij) for all i,j≦n, and do: $\begin{matrix}{{{{for}\quad k} = {{1\quad {to}\quad n} - 1}}{{{for}\quad i} = {k + {1\quad {to}\quad n}}}{{{for}\quad j} = {k + {1\quad {to}\quad n}}}\left. x_{ij}\leftarrow{x_{ij} - {x_{ik}x_{kj}}} \right.} & (3.1)\end{matrix}$

[0112] Then we will have x_(ij)=b_(ij) for i>j and x_(ij)=c_(ij) for i<j(and x_(ij)=1).

[0113] By reversing the indices both horizontally and vertically, we seethat a matrix can be written in the form UL (with L and U as above) ifand only if all of its lower right square submatrices have determinant1.

[0114] To handle more general matrices of determinant 1, we need morethan two factors. It turns out that three factors (along with a possible‘permutation’) will always suffice. In fact, since three factors givemore degrees of freedom than we need, we can be more specific byrequiring one of the three unit triangular factors to have a specialform.

[0115] Proposition 3.2.

[0116] Let A=(a_(ij))_(n,n) be an n×n matrix of determinant 1 such thatall of the submatrices (a_(i+1,j))_(k,k) for 1≦k≦n−1 have nonzerodeterminant. Then A can be written in the form U₁LU where U₁ and U areunit upper triangular, L is unit lower triangular, and the only nonzeroentries of U₁ are on the diagonal or the top row.

[0117] Proof.

[0118] We first find a matrix A′ differing from A only in the first rowso that all leading square submatrices of A′ have determinant 1. Leta′₁₁, a′₁₂, . . . , a′_(1n) denote the unknown entries of the first rowof A′. Then, once we know a′₁₁, a′₁₂, . . . , a′_(1,k−1), we candetermine a′_(1k) so that A′

k×k will have determinant 1. This condition is a linear equation fora′_(1k) and the coefficient of the linear term is ±det(a_(i+1,j))_(k−1,k−1) (define this to be 1 for k=1), which is nonzero byassumption, so there is a (unique) value which works for a′_(1k). So wecan proceed from left to right to determine all of the unknown entriesof A′.

[0119] If we can find a matrix U₁ of the required form so that A=U₁A′,then we can use Proposition 3.1 to express A′ in the form LU for unittriangular matrices L and U, giving A=U₁LU as desired. Let 1, u₂, u₃, .. . , u_(n) denote the entries of the first row of U₁. Then the unknownvalues u_(i) must satisfy the equations $\begin{matrix}\begin{matrix}{{{{a_{21}u_{2}} + {a_{31}u_{3}} + \cdots + {a_{n\quad 1}u_{n}}} = {a_{11} - a_{11}^{\prime}}},} \\{{{{a_{22}u_{2}} + {a_{32}u_{3}} + \cdots + {a_{n\quad 2}u_{n}}} = {a_{12} - a_{12}^{\prime}}},} \\\vdots\end{matrix} \\{{{a_{2n}u_{2}} + {a_{3n}u_{3}} + \cdots + {a_{nn}u_{n}}} = {a_{1n} - {a_{1n}^{\prime}.}}}\end{matrix}$

[0120] If we just look at the first n−1 of these equations, then the(n−1)×(n−1) matrix of coefficients is (a_(i+1,j))_(n−1,n−1) ^(T), whichhas-nonzero determinant, so there are unique numbers u₂, . . . , u_(n)satisfying these n−1 equations. This means that the resulting matrix U₁will be such that U₁A′ agrees with A everywhere except possibly at theupper right corner entry. But A and U₁A′ both have determinant 1, andthe cofactor of the upper right corner in the computation of thesedeterminants is det (a_(i+i,j))_(n−1,n−1)≠0, so A and U₁A′ must in factagree everywhere.

[0121] The special case n=2 of this proposition is of interest; itstates that any 2×2 matrix $A = \begin{pmatrix}a_{11} & a_{12} \\a_{21} & a_{22}\end{pmatrix}$

[0122] of determinant 1 with a₂₁≠0 can be written as a product of threeelementary matrices. It is not hard to work out this factorizationexplicitly: $\begin{matrix}{A = {\begin{pmatrix}1 & \frac{a_{11} - 1}{a_{21}} \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\a_{21} & 1\end{pmatrix}{\begin{pmatrix}1 & \frac{a_{22} - 1}{a_{21}} \\0 & 1\end{pmatrix}.}}} & (3.2)\end{matrix}$

[0123] By transposing everything and reversing the order of the factors,we get a similar factorization for a 2×2 matrix of determinant 1 withupper right entry nonzero. So it is only the diagonal matrices whichrequire four factors as discussed above in the section devoted to the2×2 diagonal matrix. (It is not hard to show that a non-identity 2×2diagonal matrix cannot be written as a product of elementary factorswithout using at least two upper factors and two lower factors—forinstance, if only one lower factor is used, then the nonzero lower leftentry of this factor will be the same as the lower left entry of theproduct.)

[0124] A given matrix of determinant 1 might not satisfy the hypothesisof Proposition 3.2, but this problem can be handled by a smallmodification of the matrix. Given any nonsingular matrix, one canpermute the rows so as to get all leading square submatrices to havenonzero determinant. (Expand the determinant of the matrix by minors onthe last column; since the determinant is nonzero, one of these minorshas nonzero determinant. So we can swap rows so that the leading(n−1)×(n−1) submatrix has nonzero determinant. Now proceed recursively.)Then we can move the last row up to the top (and negate it if necessaryto restore the determinant to +1) to get a matrix satisfying thehypotheses of Proposition 3.2. Therefore:

[0125] Theorem 3.3.

[0126] Any matrix of determinant 1 can be factored in the form ΠU₁LU,where Π is a signed permutation matrix, U₁ and U are unit uppertriangular, L is unit lower triangular, and the only nonzero entries ofU₁ are on the diagonal or the top row.

[0127] There is a version of Theorem 3.3 which applies to anynonsingular matrix A: one can factor A in the form Π{overscore (U)}₁₁LUwhere Π is a ‘permutation’ matrix, L is unit lower triangular, U is unitupper triangular, and {overscore (U)}₁ is a matrix which differs fromthe identity matrix only in its first row. (So {overscore (U)}₁ is likeU₁ except that the upper left entry of U₁ may differ from 1.) To seethis, first note that the argument just before Theorem 3.3 applies hereto find a ‘permutation’ Π such that A=Π{overscore (A)} where {overscore(A)} is such that the submatrices specified in Proposition 3.2 havenonzero determinant. Let d be the determinant of {overscore (A)}, andlet A′ be {overscore (A)} with its first row divided by d. ApplyProposition 3.2 to factor A′ in the form U₁LU, and let {overscore (U)}₁be U₁ with its first row multiplied by d; then we have A=Π{overscore(U)}₁LU as desired.

[0128] By tracing through the proofs leading up to Theorem 3.3, one canextract an algorithm for factoring a matrix of determinant 1 in thespecified form, but it will not be a good algorithm. In particular,instead of using trial and error with subdeterminants to choose a‘permutation’ Π, we would like to have a method 300 that works fasterand produces more numerically stable results. It turns out that thestandard LU-decomposition algorithm provides just what is needed.

[0129]FIG. 3 illustrates a method 300 for generating matrix factors fora given matrix A of determinant 1. Gaussian elimination can be performedon this matrix using elementary operations and permutations on the rowsonly (partial pivoting) to reduce the matrix to upper triangular form310. This means that we get the equation {tilde over (Π)}A={tilde over(L)}

, where {tilde over (Π)} is a permutation, {tilde over (L)} is unitlower triangular, and

is upper triangular 320 (it could be factored further into a diagonalmatrix {tilde over (D)} and an unit upper triangular matrix Ũ, but wewill not need that here). Note that ({tilde over (L)}

)

k×k=({tilde over (L)}

k×k) (

k×k) and the latter two matrices have nonzero determinant (thedeterminant of

k×k is the product of its diagonal entries, which is nonzero because theproduct of all of the diagonal entries of

is

=(det({tilde over (Π)}A))/(det{tilde over (L)})=±1.

[0130] So now we can take {tilde over (Π)}A transfer the bottom row tothe top, and negate this row if necessary so that the resulting matrix Âwill have determinant 1; then Â is a ‘permuted’ version of A (step 350)which satisfies the hypotheses of Proposition 3.2.

[0131] Let σ=det{tilde over (Π)} be the sign of the permutation given by{tilde over (Π)}. Then the top row of Â is the bottom row of {tilde over(Π)}A multiplied by (−1)^(n+1)σ, and we have Â={circumflex over (Π)}A,where {circumflex over (Π)} is {tilde over (Π)} with its bottom rowmoved to the top and multiplied by (−1)^(n+1)σ (step 340). Now we canwrite A=ΠÂ, where Π={circumflex over (Π)}⁻¹={circumflex over (Π)}^(T).Note that {circumflex over (Π)} and Π are ‘permutation’ matrices.

[0132] Once we have the Proposition 3.2 decomposition U₁LU of Â, we willhave factored A into the form ΠU₁LU, as desired. We will now see thatknowing the decomposition {tilde over (L)}

of {tilde over (Π)}A makes it much easier to compute the matrices U₁, L,and U.

[0133] The matrix Â has the form $\begin{pmatrix}{{\hat{a}}_{11}{\hat{a}}_{12}{\hat{a}}_{13}\quad \cdots \quad {\hat{a}}_{1n}} \\{\left. \left( {\overset{\sim}{\Pi}A} \right)\upharpoonright\left( {n - 1} \right) \right. \times n}\end{pmatrix},$

[0134] where the numbers â_(ji) are the bottom row of {tilde over (Π)}A(possibly negated) (step 330). The modified matrix A′ from the proof ofProposition 3.2 will have the form $\begin{pmatrix}{a_{11}^{\prime}\quad a_{12}^{\prime}\quad a_{13}^{\prime}\quad \cdots \quad a_{1n}^{\prime}} \\{\left. \left( {\overset{\sim}{\Pi}A} \right)\upharpoonright\left( {n - 1} \right) \right. \times n}\end{pmatrix},$

[0135] where the numbers a′_(1i) are to be chosen so that the leadingsquare submatrices of A′ all have determinant 1 (step 350).

[0136] One obtains

from {tilde over (Π)}A by performing elementary row operations asspecified by {tilde over (L)}; each of these operations adds a multipleof an earlier row to a later row. If one performs the same operations(shifted down one row) to the lower n−1 rows of the matrix A′, oneobtains the matrix ${A^{''} = \begin{pmatrix}{a_{11}^{\prime}\quad a_{12}^{\prime}\quad a_{13}^{\prime}\quad \cdots \quad a_{1n}^{\prime}} \\{\left. \overset{\sim}{DU}\upharpoonright\left( {n - 1} \right) \right. \times n}\end{pmatrix}},$

[0137] Since these operations again only add multiples of earlier rowsto later rows, they are still valid row operations when restricted toany leading square submatrix of the matrix, so they do not change thedeterminants of these submatrices. So if we find values for a′_(1i) sothat the leading square submatrices of A″ all have determinant 1, thenthe leading square submatrices of A′ will also have determinant 1.

[0138] Let v_(ij) for 1≦i, j≦n be the entries of the matrix

; then v_(ij)=0 for i>j. Also, let {overscore (d)}_(k)=v₁₁v₂₂ . . .v_(kk) (the product of the first k diagonal entries of

; this is equal to the determinant of the leading k×k submatrix of

or of {tilde over (Π)}A ). Then one can derive the following formula forthe desired values a′_(1k): $\begin{matrix}{a_{1k}^{\prime} = {\sum\limits_{i = 1}^{k}\quad {\left( {- 1} \right)^{i + 1}{\frac{v_{ik}}{{\overset{\_}{d}}_{i}}.}}}} & (3.3)\end{matrix}$

[0139] This can be re-expressed in various ways: since v_(kk)/{overscore(d)}_(k)=1/{overscore (d)}_(k−1), one can combine the last two termsinto ±(v_(k−1,k)−1)/{overscore (d)}_(k−1) and, instead of computing theproducts {overscore (d)}_(k), one can write the formula in Homer form$a_{1k}^{\prime} = {\frac{1}{v_{11}}{\left( {v_{1k} - {\frac{1}{v_{22}}\left( {v_{2k} - {\frac{1}{v_{33}}\left( {v_{3k} - \cdots} \right)}} \right)}} \right).}}$

[0140] Once we have A′, it is easy to factor it into the form LU, asdescribed earlier. It now remains to find the matrix U₁ so that Â=U₁A′.As noted in the proof of Proposition 3.2, this requires solving a systemof n−1 linear equations in n−1 unknowns u₂, u₃, . . . u_(n), and thematrix of coefficients for this system is the transpose of the lowerleft (n−1)×(n−1) submatrix of Â (step 360). But this is just (({tildeover (Π)}A)

(n−1)×(n−1))^(T), and the known factorization of {tilde over (Π)}A intotwo triangular matrices immediately gives such a factorization for thismatrix:

(({tilde over (Π)}A)

(n−1)×(n−1))^(T)=(

(n−1)×(n−1))^(T)({tilde over (L)}

(n−1)×(n−1))^(T).

[0141] Using this, we can easily solve for the unknown values u₂, . . ., u_(n), thus completing our desired factorization of A.

[0142] In summary, the method 300 for factoring the determinant-1 matrixA into the form ΠU₁LU is:

[0143] (1) Use a standard LU-decomposition algorithm (Gaussianelimination with partial pivoting) to find {tilde over (Π)}, {tilde over(L)}, and

so that {tilde over (Π)}A={tilde over (L)}

. Keep track of the number k of row interchanges performed during thisprocess, and let {circumflex over (σ)}=(−1)^(n+1+k).

[0144] (2) Compute {tilde over (Π)}A. (Perhaps this will be done duringstep (1).)

[0145] (3) Multiply the (unique nonzero entry in the) bottom row of{tilde over (Π)} by {circumflex over (σ)}, move this bottom row up tothe top (moving all the other rows down by 1), and take the transpose(i.e., invert the permutation) to get Π.

[0146] (4) Let â be the bottom row of {tilde over (Π)}A multiplied by{circumflex over (σ)}.

[0147] (5) Compute the numbers a′₁₁, a′₁₂, . . . , a′_(1n) from

according to formula (3.3), and let a′ be the row vector (a′₁₁, a′₁₂, .. . , a′_(1n)).

[0148] (6) Using standard backsolving techniques for triangular matrices(but reversed), find the row vector u satisfying the equation u{tildeover (L)}

=â−a′. Let U₁ be an n×n identity matrix with the second through n'thentries in its first row replaced by the first through (n−1)'th entriesof u.

[0149] (7) Form the matrix A′ consisting of the row a′ followed by thefirst n−1 rows of {tilde over (Π)}A (step 370). Apply (3.1) to A′ tocompute the entries of the matrices L and U.

[0150] Note that the last entry in the row vectors a and u will not beused and need not be computed. Also note that the nontrivial numbers inthe matrices U₁, L, and U can easily be packed into a single n×n matrix(with one number to spare).

[0151] The form ΠU₁LU is only one possible form for a factorization of agiven matrix A of determinant 1 (step 380); there are many other optionsfor factoring A into unit triangular matrices and a ‘permutation’matrix. For instance, one could factor A in the form ΠL_(n)UL, whereL_(n) is unit lower triangular with nonzero off-diagonal entries only onthe n'th row. (To get this, reverse the coordinates of A, factor in theform ΠU₁LU, and reverse again. In other words, conjugate by thereverse-diagonal matrix J.) Another possibility is the form LUL¹Π, whereL¹ is unit lower triangular with its nonzero entries in the firstcolumn. (Transpose A, factor in the form ΠU₁LU, and transpose again,reversing the order of factors.) Yet another possibility is to use fullpivoting rather than partial pivoting in the initial LU-decomposition,leading to a factorization A=ΠU₁LUΠ₂ with two ‘permutation’ matrices.

[0152] The form U₁ is particularly suitable for integer approximationpurposes, because the integer approximation for this factor requiresonly one coordinate to be rounded; thus, the error bound for this factoris ½, as opposed to {square root}{square root over (n−1)}/2 for thegeneral unit triangular matrix. The same applies to the form L_(n), butnot to the form L¹.

[0153] But it is good to have as many options as possible, so one canlook for a factorization that gives the best error bounds for theinteger approximation as a whole, just as described in the sectionentitled “The 2×2 Diagonal Matrix.”

Preserving Particular Lattice Points

[0154] In some cases the linear transformation A already sends someinteger lattice points to integer lattice points. This may be afundamental property of the transformation, in which case it will behighly desirable to have the approximating integer map φ match A exactlyon these particular points. An example of this situation is presented inthe next section.

[0155] One particular case of this is handled automatically by thefactorization shown in the section entitled “Larger Matrices.” Supposethat we have Ae₁=e₁, where e₁ is the elementary vector with first entry1 and remaining entries 0. This is equivalent to A having first columnequal to e₁. Then we have A(ke₁)=ke₁ for all integers k, and we wouldlike to have φ(ke₁)=ke₁ also. This turns out to be the case:

[0156] Proposition 4.1.

[0157] Any matrix A of determinant 1 such that Ae₁=e₁ can be factored inthe form ΠU₁LU, where Π is a signed permutation matrix, U₁ and U areunit upper triangular, L is unit lower triangular, the only nonzeroentries of U₁ are on the diagonal or the top row, and the integerapproximation φ to A resulting from this factorization satisfiesφ(ke₁)=ke₁ for all integers k.

[0158] Proof.

[0159] Follow the algorithm shown in the section entitled “LargerMatrices.” The first step is to use Gaussian elimination with partialpivoting to obtain the expression {tilde over (Π)}A={tilde over (L)}

. But the initial matrix A already has its first column in the desiredform, so the elimination will leave the first row alone and process theremaining rows in order to handle columns 2 through n. Therefore, we getΠe₁,=e₁, and the related matrix Π will satisfy Π(ke₂)=ke₁ (where e₂ hasa 1 in the second position and 0's elsewhere). And the matrix Aremaining to be factored has first column e₂.

[0160] The entry in position (1, 2) of A (call it a₁₂) becomes the entryin position (2, 2) of Â. When the matrix A′ is computed in the nextstep, its first column is e₁+e₂, and the second entry in row 1 comes outto be a′₁₂=a₁₂−1. We then get u₂=−1, and the first column of the matrixL is also e₁+e₂.

[0161] So we get the following when applying the matrix A in factoredform ΠU₁LU to the vector ke₁: U(ke₁)=ke₁, L(ke₁)=k(e₁+e₂),U₁(k(e₁+e₂))=ke₂, and Π(ke₂)=ke₁. In the corresponding integerapproximation, each step of this process is an integer vector anyway,and hence is not altered by rounding. Therefore, we get φ(ke₁)=ke₁ forall integers k.

[0162] Other cases where the matrix A happens to map certain integervectors to integer vectors will probably not be preserved exactly bythis integer approximation. However, if there is a particular integervector one is interested in preserving, one may be able to apply apreliminary integral linear transformation to move this vector to e₁before factoring. For instance, suppose that the linear transformation Amaps k1 to ke₁, where 1 is the vector with all entries equal to 1. Thenwe can write A as {overscore (A)}Δ, where Δ is a simple integer matrixof determinant 1 which maps 1 to e₁. Then we have {overscore (A)}e₁=e₁,so we can factor {overscore (A)} as above to get a factorization ΠU₁LUΔof A yielding an integer approximation φ which sends k1 to ke₁.

EXAMPLE The RGB→YC_(B)C_(R) Matrix

[0163] As an example, we consider a transformation for conversion ofcolors presented in standard red-green-blue coordinates. Here we willconsider only linear changes of coordinates, ignoring nonlinear visualeffects, which are not relevant for the purposes below. A popularcoordinate system used for these purposes is the YC_(B)C_(R) coordinatesystem, described in the International Telecommunications Unionstandards document ITU-R BT.601.

[0164] Coordinate systems such as YC_(B)C_(R) may be more desirable forimage transmission and/or compression, because they decrease wastefulcorrelations between the three coordinates (brighter parts of an imagewill tend to have higher values for all three coordinates) and becausecoordinate systems in which the most important part of the signal(brightness or something like it) is separated out allow differentamounts of bandwidth to be used for the different coordinates. Thesepurposes would appear incompatible with the goal of invertibility;however, it is often desirable for a compression or transmission systemto be able to operate in either lossless mode or a lossy compressedmode, so it is not unreasonable to ask for a lossless transformationfrom RGB to YC_(B)C_(R).

[0165] The RGB→YC_(B)C_(R) conversion is actually a family of lineartransformations; a particular member of this family is specified bygiving weights a_(R), a_(G), a_(B) (positive numbers summing to 1) forthe R, G, and B components. The matrix corresponding to these weights is$A = {\begin{pmatrix}a_{R} & a_{G} & a_{B} \\\frac{- a_{R}}{2 - {2a_{B}}} & \frac{- a_{G}}{2 - {2a_{B}}} & \frac{1}{2} \\\frac{1}{2} & \frac{- a_{G}}{2 - {2a_{R}}} & \frac{- a_{B}}{2 - {2a_{R}}}\end{pmatrix}.}$

[0166] The determinant of this matrix is not 1 but$\frac{a_{G}}{4\left( {1 - a_{R}} \right)\left( {1 - a_{B}} \right)}$

[0167] This is not a serious problem, though, because for decorrelationpurposes it does not matter if a scale factor is applied to the C_(R)and/or C_(B) output components, and the scale factors can be allowed forexplicitly in differential data rates. (We do not want to rescale the Ycomponent, for reasons given below.) We might as well use the same scalefactor for both of these components. This means that the first step isto pull out a scaling matrix $S = \begin{pmatrix}1 & 0 & 0 \\0 & \beta & 0 \\0 & 0 & \beta\end{pmatrix}$

[0168] (on the left) from A, where${\beta = {\frac{1}{2}\sqrt{\frac{a_{G}}{\left( {1 - a_{R}} \right)\left( {1 - a_{B}} \right)}}}},$

[0169] leaving a matrix S⁻¹A of determinant 1 to factor.

[0170] The Y output component represents the total luminance (perceivedbrightness) of the specified color. In particular, if the input color isa greyscale value with all three components equal to the same number k,then the Y component of the output will be k. (This is why Y should notbe rescaled.) The other two components are orthogonal to the black-whiteaxis; they come out to be zero for greyscale input. In other words, weare in the situation described at the end of the previous section: forany k, we have A(k1)=ke₁ (and hence (S⁻¹A)(k1)=ke₁, since S fixes e₁).To ensure that the integer approximation map preserves this property, westart by pulling out a factor Δ on the right such that Δ has determinant1 and sends 1 to e₁. There are many such matrices to choose from; onesimple one is $\Delta = {\begin{pmatrix}1 & 0 & 0 \\{- 1} & 1 & 0 \\{- 1} & 0 & 1\end{pmatrix}.}$

[0171] We are now left with a matrix {overscore (A)}=S⁻¹AΔ⁻¹ to whichthe algorithms from the section entitled “Larger Matrices” can beapplied. These yield the factorization

S ⁻¹ A=ΠU ₁ LUΔ

[0172] where ${\Pi = \begin{pmatrix}0 & 1 & 0 \\0 & 0 & 1 \\1 & 0 & 0\end{pmatrix}},\quad {U_{1} = \begin{pmatrix}1 & {- 1} & t_{1} \\0 & 1 & 0 \\0 & 0 & 1\end{pmatrix}},{L = \begin{pmatrix}1 & 0 & 0 \\1 & 1 & 0 \\0 & t_{2} & 1\end{pmatrix}},\quad {U = \begin{pmatrix}1 & {a_{G} - 1} & t_{3} \\0 & 1 & t_{4} \\0 & 0 & 1\end{pmatrix}},$

[0173] and the entries t_(i) are given by:$t_{1} = {\frac{1 - a_{B}}{1 - a_{R}} - \frac{\left( {2 - {2a_{B}}} \right)\beta}{a_{G}}}$$t_{2} = {- \frac{a_{G}}{\left( {2 - {2a_{B}}} \right)\beta}}$$t_{3} = {a_{B} + 1 - \frac{{\left( {2 - {2a_{B}}} \right)\beta} - a_{R}}{a_{G}}}$$t_{4} = {\frac{{\left( {2 - {2a_{B}}} \right)\beta} - a_{R}}{a_{G}} - 1}$

[0174] A special case of interest is presented by the set of valuesprovided in the ITU-R BT.601 standard, which values are as follows:

[0175] a_(R)=0.299, a_(G)=0.587, a_(B)=0.114.

[0176] In this case, the numerical values of the non-integer entries inthe above matrices are:

[0177] β=0.4860860807044331 a_(G)−1=−0.413

[0178] t₁≈−0.2034584787387865 t₃≈0.1560024957269925

[0179] t₂≈−0.6814926851476152 t₄≈−0.0420024957269925

[0180] We can now apply the error analysis methods presented earlier toobtain error bounds for this transformation. Note that the integerisometry Π has no effect on the errors and can be ignored. The integerapproximations to matrices U₁ and L only involve one rounding, becausethese matrices have only one non-integer row each; so the error boundfor each of these matrices is ½, while the error bound for U (which hastwo non-integer rows) is {square root}{square root over (2)}/2. Theerror bound for the integer matrix Δ is 0. After computing the norms

∥U₁∥≈1.6328964907799118,

∥U₁L∥≈1.6965471828414659, and

∥U₁LU∥≈1.4371952820317682,

[0181] we can apply (1.2) to get a forward error bound of2.5160882629800899. Since the inverse to the integer approximation iscomputed differently, we cannot bound its error by applying (1.2)directly; instead we compute ∥(S⁻¹A)⁻¹∥≈1.8003448245653902 and apply(1.3) to get an inverse error bound of 4.5298264824059275.

[0182] One gets better bounds by keeping track of the errors from theseparate roundings as discussed in the section entitled “The 2×2Diagonal Matrix”. Under the worst-case assumption that these errors areindependent, one gets error bounds of 1.5472559440649816 for the forwardtransform and 1.7941398552787594 for the inverse transform.

[0183] One can get lower bounds on the error (and thus gauge theaccuracy of the preceding upper bounds) by testing the approximation ona large collection of sample inputs. One can reduce the computationneeded here by using the fact that the factorization preserves themapping k1

ke₂. If one applies the approximation to x and to x+k1, then at everystep the two results will be the same except that the second result willhave k added to some coordinates; in particular, the rounding errorswill be exactly the same. Similarly, the inverse to the approximationwill give exactly the same errors for input vectors y and y+ke₁.

[0184] For the forward transform, a search through all input vectors(x₁, x₂, x₃)∈Z³ with |x₂−x₁|<33000 and |x₃−x₁|<33000 (only theserelative differences matter) yielded a largest error of1.5404029289484810 at the input vector (19352, 0, 20840). For theinverse transform, a search through all input vectors (x₁, x₂, x₃)∈Z³with |x₂|<33000 and |x₃|<33000 (the value of x₁ is irrelevant) yielded alargest error of 1.7905956082490824 at the input vector (8360, 31316,8995). These examples show that the upper bounds given above are eithersharp or very close to it.

[0185] There were a number of choices made in the construction of thisfactorization (the form ΠU₁LU, the particular matrices S and Δ, and soon). Different choices would lead to alternative factorizations, some ofwhich might have better error bounds than the factorization presentedhere.

More General Bijections

[0186] As mentioned in the section entitled “Introduction”, theapproximation problem in the fixed-length case is equivalent to findinga bijection ψ from a transformed lattice AZ^(n) to the standard integerlattice Z_(n) which moves points as small a distance as possible (sothat the integer mapping φ=ψ·A is a bijection approximating A). One canimagine many ways of trying to find such a bijection, ψ; the problemseems to be a combinatorial one.

[0187] Given a matrix A of determinant ±1, we know by now that such mapsψ do exist so that the errors (the distances that points are moved) arebounded. Each such ψ has a supremal error sup_(x∈AZ) _(^(n)) ∥ψx−x∥ (wecannot say “maximal error,” because it may be that there is no singlepoint x for which ∥ψx−x∥ is maximal). It is natural to ask whether thereis a bijection ψ which is optimal in the sense that its supremal erroris as small as possible; it is conceivable that there would be nooptimal bijection, because the ultimate error bound could be approachedbut not attained. This turns out not to be the case:

[0188] Proposition 6.1.

[0189] For any real n×n matrix A of determinant ±1, there is a bijectionψ:AZ^(v)→Z^(n) which is optimal in the sense that sup_(x∈AZ) _(^(n))∥ψx−x∥ is minimal over all such bijections.

[0190] Proof.

[0191] First find an integer approximation ψ₁ with bounded error, andlet ε₁ be an error bound for ψ₁. Then we only need to search amongbijections with error bounded by ε₁ to find an optimal one ψ. If ε₁ isfixed, then there are only finitely many possibilities for ψx for anygiven x∈AZ^(n)(i.e., only finitely many standard lattice points withindistance ε₁ of x) and only finitely many possibilities for ψ⁻¹y for anygiven y∈Z_(n).

[0192] Now a standard compactness argument can be used to complete theproof. There are several ways to express this argument. One is to notethat the space of integer approximations ψ satisfying the error bound ε₁can be given a metric (let x₁, x₂, . . . list the vectors inAZ^(n)∪Z^(n), and define the distance between distinct approximations ψand ψ′ to be 1/k where k is least so that ψx_(k)≠ψ′⁻¹x_(k) orψ⁻¹x_(k)≠ψ′⁻¹x_(k)) so that it becomes a compact space, and the supremalerror is a lower semicontinuous function from this space to the realnumbers, so it must attain a minimum value. Another is as follows: Let εbe the infimum of the supremal errors of approximations ψ. For anyfinite sets S⊂AZ^(n) and S′⊂Z^(n), there are only finitely many ways topartially define an integer approximation ψ on S and S′ (i.e., define ψxfor x∈S and ψ⁻¹y for y∈S′) so as to meet the error bound ε₁; so theremust be one whose partial error bound on this finite configuration is assmall as possible. Since one can find complete integer approximationswith supremal errors as close as desired to ε, the partial error boundmust be at most ε. So, for any finite parts of the domain and rangelattices, we can define ψ and ψ⁻¹ on these parts so as to attain theerror bound ε; and there are only finitely many ways to do so. Now wecan apply König's infinity lemma to put these together to obtain acomplete integer approximation attaining the bound ε, which is thereforeoptimal.

[0193] In general, the optimal lattice bijection is not unique. Also,this proof is quite non-constructive, and it is not clear that theoptimal bijection(s) will be implementable or describable in any usefulway.

[0194] Let us now examine the case where the matrix A has rationalentries. Then the transformed lattice AZ^(n) will contain many pointsthat are also in the standard lattice Z^(n); in fact, the intersectionL=AZ^(n)∩Z^(n) is a full n-dimensional lattice. (To see this, it isenough to get n independent vectors in L; one can do this by taking then independent columns of A and multiplying each by its least commondenominator to get an integer vector.) This means that the configurationof points in the two lattices is periodic: the configuration at x looksjust like the configuration at x+a for any a∈L.

[0195] Now L is a subgroup of Z^(n) of finite index (which can becomputed by forming a matrix whose columns are n generating vectors forL and taking the absolute value of its determinant), and is a subgroupof AZ^(n) with this same index (because the determinant of A is ±1). Soone can pair off the L-cosets in AZ^(n) with the L-cosets in Z^(n). Anytwo cosets of L are translates of one another, and such a translationgives a bijection between the cosets. If we take such a translation fromeach L-coset in AZ^(n) to the corresponding L-coset in Z^(n), we get abijection ψ from AZ^(n) to Z^(n) which is of bounded error; in fact, themaximum error is the largest of the norms of the translation vectorsused.

[0196] Just like the lattice configuration, the action of the mapping ψlooks the same near x as near x+a for any a∈L. In fact, the displacementψx−x is a periodic function of x. We will refer to such a bijection ψ asa ‘periodic’ bijection, and to the corresponding integer approximation φas a ‘periodic’ approximation.

[0197] There are only finitely many ways to pair off the L-cosets of thetwo lattices; and for each pair of cosets there are only finitely manytranslations from one to the other with translation vector of norm belowa specified bound. (Given any point in the first coset, there are onlyfinitely many points in the second coset within the specified distanceof the given point; these give the finitely many translation vectors onecan try. Clearly there is a best translation vector, although it may notbe unique. In fact, the pairing between cosets can be thought of as abijection between two finite lattices on the n-dimensional torus R^(n)/Lwith a suitable metric.) So there are only finitely many ‘periodic’bijections meeting any specified error bound; it follows that there mustbe an optimal ‘periodic’ bijection whose maximum error (in the‘periodic’ case, a maximum error is attained) is as small as possible.It turns out that this is optimal among all bijections:

[0198] Proposition 6.2.

[0199] For any rational n×n matrix A of determinant ±1, an optimal‘periodic’ integer approximation to A will in fact be optimal among allinteger approximations.

[0200] Proof.

[0201] It suffices to show that, if there is any bijection ψ from AZ^(n)to Z^(n) meeting error bound ε, then there is a ‘periodic’ bijectionmeeting error bound ε.

[0202] Let m be the index of L in Z^(n) (and in AZ^(n)). Then, for anyn-cube of side-length s, the number of points of any L-coset in the cubeis s^(n)/m+o(s^(n)). This means that we can find a large cube B and apositive natural number N such that every L-coset contains at least Npoints inside B and at most N+N/m points within distance ε of B (becausethey would lie in a slightly larger cube of side-length s+2ε).

[0203] Now, for any k≦m, if we put together k of the m cosets of L inAZ^(n), we get at least kN points inside B. These are mapped by ψ to atleast kN points within distance ε of B. These image points cannot beincluded in k−1 cosets of L in Z^(n), because

(k−1)(N+N/m)<kN

[0204] for k≦m. So the image points meet at least k cosets of L inZ^(n).

[0205] Therefore, by the Marriage Theorem, there is a one-to-one pairing(hence a bijection) from the source cosets to the target cosets so that,if C_(i) is paired with C′_(i), then we can find x_(i)∈C_(i) such thatψx_(i)∈C′_(i). Let a_(i)=ψx_(i)−x_(i); then ∥a_(i)∥≦ε. Using this cosetpairing and the translation vectors a_(i), construct a ‘periodic’bijection ψ′; then ψ′ meets the error bound ε, as desired.

[0206] Propositions 6.1 and 6.2 also work if one is trying to optimizethe approximation error for the inverse transform, or some combinationof the forward and inverse errors (e.g., the maximum of the two). Notethat the inverse of a ‘periodic’ approximation is a ‘periodic’approximation to the inverse linear transformation.

[0207] As an example, let us consider the diagonal matrix$D = \begin{pmatrix}\alpha & 0 \\0 & \alpha^{- 1}\end{pmatrix}$

[0208] in the simple case α=2. Here the lattice L is just 2Z×Z (i.e.,the set of integer pairs such that the first coordinate is even), andthere are two cosets of L in each of the lattices DZ^(n) and Z^(n).Hence, there are only two ways to pair off the cosets; the one whichgives the smaller error is the one which maps L to (1, 0)+L and (0, ½)+Lto L. This yields a bijection o with maximum error 1. The formula forthe corresponding bijection ψ approximating D is:${\phi \begin{pmatrix}m \\n\end{pmatrix}} = \left\{ \begin{matrix}\begin{pmatrix}{{2m} + 1} \\{n/2}\end{pmatrix} & {{{if}\quad n\quad {is}\quad {even}},} \\\begin{pmatrix}{2m} \\{\left( {n - 1} \right)/2}\end{pmatrix} & {{if}\quad n\quad {is}\quad {{odd}.}}\end{matrix} \right.$

[0209] Note that a greedier algorithm for constructing the bijectionmight have started by mapping all the points in L to themselves (error0); but then the points in (0, ½)+L would have to be mapped to (1, 0)+L,leading to a larger overall error of {square root}{square root over(5)}/2. Also, for this particular example the approximation which isoptimal for the forward error also happens to be optimal for the inverseerror; there is no reason to believe that this happens in general.

[0210] For other rational matrices A, there will probably be more cosetsto deal with; in this case, the implementation of a ‘periodic’ functionwill probably be by table lookup. To apply the approximating map φ to agiven integer vector x, one will determine which coset C_(k) of thesublattice A⁻¹L contains x (which is equivalent to determining whichcoset of L in AZ^(n) contains Ax), find in the table a correspondingrational vector a_(k), and let φx=Ax+a_(k). Note that, for a generallattice A⁻¹L, determining which coset contains x may not be trivial. Itmay be more convenient to use a smaller lattice L′{overscore (⊂)}A⁻¹L ofthe form L′=m₁Z×M₂Z× . . . ×m_(n)Z; this will make the table longer, butwill make it much easier to determine which coset contains x. Thenumbers m_(j) are easily computed: m_(j) is the least common denominatorof the rational numbers in column j of the matrix A.

[0211] Finding the best ‘periodic’ approximation is a finitecombinatorial search problem. There is an algorithm for solving thisproblem in time polynomial in the number of cosets. Determining whetherthere is a pairing of source cosets with target cosets meeting a givenerror bound (and finding one if there is) is a bipartite matchingproblem which can be solved in polynomial time by network flow methods.The correct optimal bound will be one of the n² distances between asource coset and a target coset; using a binary search, one can find theoptimal bound by solving ┌2log₂n┐ bipartite matching problems.

[0212] Of course, if the number of cosets is too large for the optimal‘periodic’ approximation to be implemented (let alone found), then onewill need to use a different approximation algorithm, even if it issuboptimal.

[0213] In order to see how sharp computed upper bounds are, or how closeto optimal a given bijection might be, it is useful to obtain lowerbounds on the possible supremal errors of integer approximations orlattice bijections. One way to do this (in fact, by the argument ofProposition 6.1, essentially the most general way) is to examine finiteparts of the two given lattices and show that one cannot even define apartial bijection on these finite parts without incurring an error of atleast E .

[0214] One finite configuration is easy to use: if the transformedlattice AZ^(n) contains a point x which is at distance δ from thenearest point in the standard lattice Z^(n), then any bijection fromAZ^(n) to Z^(n) must have error at least δ. (The same applies if somepoint in Z^(n) is at distance at least δ from the nearest point ofAZ^(n).) In particular, if AZ^(n) contains points arbitrarily close tothe centers of cubes in the standard lattice Z^(n) (this will be trueif, for instance, some column of A has entries a_(1j), . . . , a_(nj)such that a_(1j), . . . , a_(nj), 1 are linearly independent over therationals), then the supremal error must be at least {squareroot}{square root over (n)}/2.

[0215] To obtain better lower bounds, one must analyze the interactionsbetween points in the domain lattice—if x≠x′ and ψx=y, then ψx cannotalso be y, so it may end up being farther from x′. Such analysis ishighly dependent on the particular matrix A. In the case of the 2×2diagonal matrix D, one can substantially improve the lower bound:

[0216] Proposition 6.3.

[0217] If α>0 is given, then, for any integer bijection φ approximatingthe diagonal matrix D, the error sup_(x∈Z) _(²) ∥Dx−φx∥ must be at least{overscore (ε)}(α), where: if α>1 is irrational, then${{\overset{\_}{ɛ}(\alpha)} = \sqrt{\left( \frac{1 - {\left( {{2k} - 1} \right)\alpha^{- 1}}}{2} \right)^{2} + k^{2}}},$

[0218] where k=┌(α−1)/2┐; if α>1 is a rational number m/n in lowestterms, then${{\overset{\_}{ɛ}(\alpha)} = \sqrt{\left( {\left\lfloor \frac{m - {\left( {{2k} - 1} \right)n}}{2} \right\rfloor/m} \right)^{2} + k^{2}}},$

[0219] where k is as above; if α=1, then {overscore (ε)}(α)=0; if α<1,then {overscore (ε)}(α)={overscore (ε)}(α⁻¹).

[0220] Proof.

[0221] The case α=1 is trivial. Any bound which works for α also worksfor α⁻¹ (because D(α⁻¹) is just D(α) with the two coordinatesinterchanged). So we may assume α>1.

[0222] Let ε be the supremal error for φ; we must show that ε≧{overscore(ε)}(α). Consider the corresponding bijection ψ=ψ·D⁻¹ from DZ² to DZ²,and look at the points of DZ² on the y-axis. These points are spaced ata distance α⁻¹ apart, which is too crowded; some of them will have to bemoved by ψ to points not on the y-axis. (This statement may appearrather vague. To make it more precise, consider a large finite number s.The number of points of DZ² on the y-axis within distance s of theorigin is 2└sα┘+1. These points are sent by ψ, to points of Z² withindistance s+ε of the origin; since the number of such points on they-axis is only 2└s+ε┘+1, which is smaller than 2└sα┘+1 for large s, ψmust map some of these points on the y-axis to points not on the y-axis.The statements in the remainder of this proof can be made precise in thesame way, but actually doing so would make the proof far less readable,so we prefer to state the arguments more informally.)

[0223] In fact, only a fraction 1/α at most of the domain points on they-axis can be mapped to range points on the y-axis. Similarly, for anyother vertical line, ψ maps at most the fraction 1/α of the domainpoints on the y-axis to range points on this vertical line.

[0224] The number k was chosen so that (2k −1)/α<1. The map ψ sends atmost the fraction (2k −1)/α of domain points on the y-axis to rangepoints with x-coordinate of absolute value less than k, because theserange points are on 2k −1 vertical lines. So, for the remaining fraction1−(2k −1)/α of the points on the y-axis, the map ψ introduces an errorof at least k horizontally.

[0225] If α is irrational, then the vertical distances from points onthe y-axis in the domain lattice to the nearest points in the rangelattice are spread out uniformly over the interval [0, ½). So, even ifwe choose the points of least possible vertical error to be givenhorizontal error at least k, the vertical errors will have to range upto at least (1−(2k −1)/α)/2, so the total errors will range up to{overscore (ε)}(α).

[0226] If α=m/n in lowest terms, then 1/m of the domain points on they-axis will entail no vertical error (because they are already standardlattice points), 2/m of them will entail vertical error of 1/m at least,2/m will entail vertical error at least 2/m, and so on. If we again tryto find the minimum possible vertical errors to combine with the largehorizontal errors, we see that we are forced to use vertical errors upto and including └(m−(2k −1)n)/2┘/m, thus leading to a combined error of{overscore (ε)}(α).

[0227] In particular, for α={square root}{square root over (2)} nointeger approximation can have an error bound better than {squareroot}{square root over (22−42)}/4≈1.0106664184619603; so theapproximation obtained from factorization (3.3) is not very far fromoptimal. And for any a α≠1 the error bound must be at least 1.

[0228] There is no reason to expect the lower bound from Proposition 6.3to be sharp in most cases; examination of lattice points other thanthose on the one line considered in that proof could show that largererrors must occur.

[0229] The proof of Proposition 6.3 applies to any matrix A having acolumn whose only nonzero entry has absolute value less than 1; asimilar argument works to give a lower bound on the error when there isa column whose only nonzero entry has absolute value greater than 1.This can be generalized to other situations where A maps a rationalsubspace to a rational subspace with the “wrong” scaling.

Additional Considerations

[0230] A number of matrix factorization methods for obtaining integerbijections approximating given linear transformations on fixed-lengthvectors have been considered. Such bijections exist and are easy toimplement, and can be made to have additional desirable properties, suchas preservation of suitable integer inputs (which are preserved by thegiven transformation). Approximation methods that are not based onsimple matrix factorizations were also considered.

[0231] There are many possibilities that remain to be explored,including additional factorizations of matrices, different integerapproximations of matrix factors, more integer approximation methodshaving nothing to do with factorizations, and improved error analysis.

[0232] For instance, as noted earlier, unit triangular matrices canproduce larger error bounds for the inverse transform than for theforward transform, because the inverse transform is computedrecursively. One might be able to compute the transform in a differentway, perhaps doing the recursion in the forward transform for somefactors and in the inverse transform for other factors, so as to balanceout the errors. Or one could try to use a different sort of factormatrix which does not have this problem. For instance, suppose wepartition the coordinates or bands into two groups, and consider twokinds of factors: one where linear combinations of first-groupcoordinates are added to second-group coordinates, and one where linearcombinations of second-group coordinates are added to first-groupcoordinates. Then recursion would not be needed to invert any of thesefactor matrices, and one may be able to get better overall error bounds.On the other hand, degree-of-freedom counting shows that suchfactorizations would require at least four factors in the generalfixed-length case, if the length is greater than 2; and more detailedanalysis shows that even four factors is not enough. It is likely thatthe additional factors will outweigh the benefit from eliminatingrecursion in the inverse.

[0233] As for the error analysis, even the simple case of a 2×2 diagonalmatrix was not analyzed completely. In more complicated cases theanalysis was quite selective; many variant factorizations remain to beexamined. And everything was based on the initial assumption that thegoal was to minimize the worst-case error in the integer approximationof the transform (and perhaps the inverse transform). Some applicationsmay entail optimizing with respect to some other parameter, in whichcase different integer approximations may work better.

Unbounded Length Vectors

[0234] As discussed earlier, the second version of the problem involvesanalyzing input vectors for signals having unbounded length (number ofcoordinates), but which are of bounded amplitude (i.e., the valuesappearing as coordinates of the vector are bounded). Such signals aretreated as bounded sequences of real numbers that are essentiallyinfinite in both directions. In practice, however, the signals will beof finite length and boundary conditions will be needed at the ends ofthese sequences.

[0235] The use of infinite sequences of real numbers imposes tworestrictions. The first restriction is a time-invariance condition.Strict time invariance or shift invariance would require that shiftingthe input signal over by one step would result in the same output signalalso shifted over by one step. This is too strong, though; instead werequire that the coordinates of the output signal be obtained byapplying n time-invariant transformations in rotation (so shifting theinput n steps results in the same output shifted by n steps). This canalso be expressed as applying n time-invariant mappings or “filters,”taking only every n'th coordinate of each output signal (“downsampling”), and merging the results.

[0236] In such a case the output signal consists of n differentsubsignals or “bands” merged together. One can also treat the inputsignal as comprising n bands in the same way. So the input signal isconceptually broken up into blocks of length n; the j'th band consistsof the j'th component of each block. (Sometimes the input signal ispresented in n separate bands already.) So the input and output signalscan be thought of as essentially infinite sequences of members of R^(n),and the linear transformation as a fully time-invariant mapping in thisformulation.

[0237] The second restriction is that a component of the output signaldepends on only finitely many components of the input signal; atransformation with this property is called FIR (finite impulseresponse). A time-invariant (or n-fold time-invariant as above) FIRlinear transformation must produce a bounded-amplitude output signalwhen applied to a bounded-amplitude input signal. The part of the inputsignal on which a given output coordinate depends (the “stencil” of thetransformation) will often include more than n coordinates.

[0238] A linear transformation with these properties can be described byn×n matrices M_(k) for k∈Z, only finitely many of which are nonzero. Theinput signal x is a sequence of n-vectors x_(i), and the output signaly=f(x) is a sequence of n-vectors y_(j); these are related by theformula$y_{j} = {{\sum\limits_{k}{M_{k}x_{j + k}}} = {\sum\limits_{i}{M_{i - j}x_{i}}}}$

[0239] (the sums are over all integers, but only finitely many terms arenonzero). This can be more conveniently expressed in terms of thez-transform, which we think of here as a generating function approach.If we introduce the generating functions p(z)=Σ_(i)x_(i)z^(i) andq(z)=Σ_(j)y_(j)z^(j) for the input and output signals (these can bethought of as series of n-vectors or as n-vectors of series), and wealso define the matrix A(z) to be Σ_(k)M_(k)z^(−k), then the formulaabove becomes simply q(z)=A(z)p(z). The z-transform matrix A(z)(commonly called the polyphase matrix of the transformation) is a matrixwhose entries are Laurent polynomials over R, i.e., members of the ringR [z,z⁻¹]. If no negative powers of z occur in the matrix, then theoutput vector at time j depends only on the input vectors at time j andearlier times (the transformation is causal). Just as for fixed-lengthtransformations, composition of transformations here corresponds tomultiplication of the associated z-transform matrices.

[0240] We will assume that the given linear transformation is invertibleand that the inverse transformation is also FIR (it is automaticallylinear and time-invariant). In this case, the original transformation issaid to admit perfect reconstruction. So the inverse transformation isalso given by a z-transform matrix B(z), and if p(z) and q(z) are asabove, then we have p(z)=B(z)q(z)=B(z)A(z)p(z). Since this holds for allinput signals, B(z) must be the inverse matrix to A(z).

[0241] We will require our integer approximation maps to be FIR andtime-invariant (on n-vectors), but not necessarily linear. And we imposethe same restrictions on the inverse maps.

[0242] In order to measure the error of an integer approximation, weneed a norm on the space of signals; the Euclidean norm does not applyto infinite-length signals. Since we are working with bounded-amplitudesignals, we could simply take the supremum of the absolute values of thecomponents of the signal. But since we are thinking of the signal y as asequence of vectors y_(j)∈R^(n), it is natural to define the norm ∥y∥ tobe sup_(j)(∥y_(j)∥. Then the error of an integer approximation φ to agiven linear transformation A is just the supremum of ∥Ax−φx∥ over allinput signals x. (We will abuse notation slightly by using A for atransformation and A(z) for its z-transform matrix.)

Invertible Integer-to-Integer Signal Mappings

[0243] As we discussed earlier, in the fixed-length case, a necessarycondition for the existence of a bounded-error integer approximation φto the linear transformation A is that det A=±1. We may as well assumethat the determinant is 1, because, if it is −1, we can negate a row ofthe matrix to change the determinant to +1.

[0244] In the unbounded-length case, the linear transformation is givenby a matrix A(z) over R [z,z⁻¹]. We are assuming that the inversetransformation is also given by such a matrix, which must be the inverseof A(z), so detA must be an invertible element of the ring R [z,z⁻¹],i.e., a nonzero monomial cz^(k).

[0245] If we look at an integer input signal that is constant on eachband, then the output Signal will also be constant on each band; thisessentially reduces to the case of vectors of fixed length n. Theconstant matrix for this fixed-length transformation is just A(1). Sincean integer approximation for general signals certainly gives an integerapproximation for these particular signals, the matrix A(1) must satisfythe necessary condition above, detA(1)=±1. So the monomial detA(z) mustbe ±z^(k) for some integer k. Again we can pull this factor out of oneof the bands to reduce to the case of a transformation of determinant 1;an integer approximation for the modified matrix easily yields one forthe original matrix Oust shift and/or negate one band at the end).

[0246] As described earlier, the main approach will be to factor a givenz-transform matrix into matrices of special form, mainly ‘permutation’matrices (ordinary permutation matrices with some entries negated) andelementary matrices. The ‘permutation’ matrices are easy to handle,because they already map integer signals to integer signals (they justrearrange and possibly negate the bands).

[0247] An elementary matrix factor (differing from the identity only ata single off-diagonal entry) corresponds to a transformation which addsa multiple of one band (or, if the off-diagonal entry has several terms,multiples of shifted copies of one band) to another band. Factorizationsinto such matrices have been considered by a number of those skilled inart, such factors, at least in the 2×2 case, are also known as liftings.

[0248] If a transformation is given by an elementary matrix which addssome modification (combination of shifts and constant multiples) of bandi to band j, then we get an integer-to-integer approximation to thetransformation by simply rounding the modification of band i to aninteger before adding it to band j. This is easily invertible: simplysubtract the same rounded modification of band i from band j. Thisapplies more generally to matrices given by unit triangular matrices(lower or upper triangular matrices whose diagonal entries are all 1).

[0249] A number of the calculations presented earlier can be appliedwithout change in the present context, given suitable definitions. Inparticular, we define the norm ∥A∥ of a signal transformation A (or thenorm ∥A(z)∥ of its associated z-transform matrix) to be the supremum of∥A_(i)x∥/∥x∥ over all nonzero bounded inputs x (where ∥x∥ is defined asin the preceding section). Then, if A=A₁A₂ . . . A_(k) where each A_(i)can be approximated by an integer mapping φ_(i) with error bound C_(i),then A can be approximated by the composition of these integer mappingswith error bound

C₁+∥A₁∥C₂+∥A₁∥∥A₂∥C₃+ . . . +∥A₁∥∥A₂∥ . . . ∥A_(k−1)∥C_(k).  (9.1)

[0250] In fact, this bound can be slightly improved to:

C₁+∥A₁∥C₂+∥A₁A₂∥C₃+ . . . +∥A₁A₂ . . . A_(k−1)∥C_(k).  (9.2)

[0251] Also, if φ approximates A, then φ⁻¹ approximates A⁻¹, because ifx=φ⁻¹y, then

∥A∥⁻¹∥Ax−φx∥≦∥A⁻¹y∥≦∥A⁻¹∥∥Ax−φx∥.  (9.3)

[0252] In this section, we will concentrate on one-dimensional signals,but the methods are also applicable to multidimensional signaltransformations (i.e., to matrices whose entries are Laurent polynomialsin several variables rather than the single variable z). In particular,elementary matrices are approximable by integer bijections as above evenin the multidimensional case. The main difference is that it is moredifficult if not impossible to factor a given multidimensional matrix ofdeterminant 1 into elementary matrices.

Factoring A z-Transform Matrix

[0253] The Gaussian elimination method for factoring a matrix over Rinto elementary matrices and a diagonal matrix (and maybe a permutationmatrix as well) can be extended to the case of matrices over R[z,z⁻¹].This is the Laurent polynomial version of the algorithm for reducing amatrix polynomial to Smith normal form. The Smith normal form and avariety of methods for reducing a matrix polynomial to Smith normal formare known by those of ordinary skill in this field, where such methodsinvolve, for instance, a Laurent polynomial case for 2×2 matrics and acase for n x n matrices.

[0254] Here we are concerned with the perfect reconstruction case, so weassume that the determinant of the given matrix A(z) is a nonzeromonomial. In fact, by pulling out a diagonal matrix factor to beginwith, we can reduce to the case where detA(z)=1. The entries in such adiagonal matrix represent scaling (the numerical coefficients) anddelays or advances (the powers of z) for the corresponding bands.

[0255] The main part of the algorithm uses elementary row operations(each of which corresponds to pulling out an elementary matrix factor onthe left). Start by selecting a column to reduce (say, the firstcolumn). If this column has more than one nonzero entry, then choose twononzero entries, say a(z) and b(z). Suppose that the ‘degree’ of a(z) isat least as large as the ‘degree’ of b(z). (Here we define the ‘degree’of a Laurent polynomial to be the degree of the highest-degree termminus the degree of the lowest-degree term; the ‘degree’ of 0 is −∞.)Then we can perform an elementary row operation which subtracts asuitable multiple of b(z) from a(z) so that the difference has lower‘degree’ than a(z). (One can actually choose the multiple so that thedifference has lower ‘degree’ than b(z). However, it will be usefullater to not require this, even if it means that more reduction stepsare needed.) Repeat this process until all but one of the entries in theselected column are 0. Since detA(z) has ‘degree’ 0, the remainingnonzero entry must be a nonzero monomial.

[0256] Now select a second column, and do the same reduction to all ofthe entries in this column except the one in the row containing thenonzero entry from the first column (this row is excluded for now). Soonly one such entry will be nonzero, and again this entry must be anonzero monomial. This means that, with one more row operation, we canzero out the entry in the excluded row in the second column. Do the samething in a third column (now there are two excluded rows), and so onuntil all columns are processed. What remains will be a permuteddiagonal matrix, with the nonzero entries being monomials with product1.

[0257] After pulling out a permutation matrix (or a ‘permutation’ matrixof determinant 1), we are left with a diagonal matrix of determinant 1.This can be written as a product of diagonal matrices each of which hasonly two non-1 diagonal entries, which are reciprocals of each other.Then, if desired, one can write each of these essentially 2×2 diagonalmatrices as a product of elementary matrices using the formulasdiscussed previously. In fact, if one really wants to, one can evenwrite the ‘permutation’ matrix of determinant 1 as such a product aswell, because such a ‘permutation’ can be written as a product of simple‘permutations’ each of which just swaps two rows and negates one ofthem, and such a simple ‘permutation’ can be written as a product ofthree elementary matrices: $\begin{matrix}{\begin{pmatrix}0 & {- 1} \\1 & 0\end{pmatrix} = {\begin{pmatrix}1 & 0 \\1 & 1\end{pmatrix}\begin{pmatrix}1 & {- 1} \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\1 & 1\end{pmatrix}}} & (10.1)\end{matrix}$

[0258] If one factors all the way down to elementary matrices in thisway (leaving the ‘permutation’ matrix unfactored), then a great manyfactors might be required. But it turns out that unit triangularmatrices are as good for our purposes as elementary matrices (simplerounding works as an integer approximation method, as discussedpreviously); using these, one can get by with far fewer factors, becauseone permuted unit triangular matrix can replace a large number (up ton(n−1)/2 of elementary matrices. To see this, suppose we are in theprocess of reducing a column. Say p₁(z) is the nonzero entry in thiscolumn of lowest ‘degree’; use elementary row operations to subtractmultiples of p₁(z) from the other nonzero entries to get their ‘degrees’below that of p₁(z). Let p₂(z) be the newly modified entry of least‘degree,’ and subtract multiples of p₂(z) from the other nonzero entries(excluding p₁(z)) to reduce their ‘degrees’ below that of p₂(z). Nowchoose p₃ (z) and reduce the other nonzero entries, excluding p₁(z) andp₂(z); and so on. All of the reduction steps described here can becombined into a single permuted unit triangular matrix.

[0259] However, there is no fixed bound (depending on n alone) for thenumber of factors needed here, even if these more general factors areallowed; if the entries of the matrix have very high ‘degree,’ then manyfactors might be required.

[0260] If one is interested in factoring a causal linear transformation(one where no negative powers of z occur in the corresponding matrix)into causal elementary factors, one can do so by following the sameprocedure as above, using ordinary polynomial degrees instead of‘degrees’. This is just the ordinary reduction of a polynomial matrix toSmith normal form. In this case, if the determinant of the matrix hasone or more factors z, one may not be able to remove them at thebeginning; instead one follows the Smith normal form process (which isslightly more involved in this case) and ends up with a diagonal matrixin the middle of the factorization. If this diagonal matrix hasdeterminant ±z^(k), then one can express it as a constant diagonalmatrix of determinant 1 (which can be factored into elementary matrices,as discussed earlier) and a diagonal matrix with entries of the form±z^(j) (which must be handled some other way).

Factors Which Preserve Constant Signals

[0261] As described earlier, one can consider the case where the givenlinear transformation already sends certain integer-valued inputs tointeger-valued outputs, and we want the integer-to-integer approximatingmap to give the same results for these inputs.

[0262] In particular, let us consider the constant input signal withvalue k on all coordinates. Most filter banks are set up with onelow-pass filter and one or more higher-pass filters. The higher-passfilters should have zero response to a constant signal, while thelow-pass filter should give a constant nonzero response (preferably thesame constant as the input). If the low-pass filter appears first in thebank, then the above properties can be expressed in terms of thez-transform matrix M(z) for the filter bank by the equation M(1)1=e₁,where 1 is the vector in R^(n) with all coordinates 1 and e₁ is thevector with first coordinate 1 and remaining coordinates 0.

[0263] We also consider the closely related family of matrices A(z) suchthat A(1)e₁=e₁. Such a matrix, when applied to an input consisting of aconstant signal on band 1 and zero on all other bands, returns thatinput unchanged. (Such a matrix would commonly occur for a processingstep applied to a signal after it had already been separated intolow-frequency and high-frequency bands.) One can convert from a matrixM(z) satisfying M(1)1=e₁ to the form A(z) by pulling out a constantmatrix factor Δ which sends 1 to e₁ as described earlier: if M(z)=A(z)Δ,then M(1)=A(1)Δ, so M (1)1=e₁ if and only if A(1)e₁=e₁.

[0264] The condition A(1)e₁=e₁ is equivalent to the statement that theleftmost column of A(1) is e₁. This means that, in the matrix A(z), allentries in the first column are divisible by z−1 except for the firstentry, which has remainder 1 when divided by z−1.

[0265] Let G be the set of all matrices A(z) with entries from R[z,z⁻¹]which have determinant 1 and satisfy the equation A(1)e₁=e₁. It is easyto see that G is a group. The set of matrices M(z) of determinant 1which satisfy M(1)1=e₁ is the right coset GΔ of G.

[0266] If we have an elementary matrix in G, then its standard integerapproximation also leaves a constant integer signal in band 1 (withzeros elsewhere) unchanged. So any matrix which can be factored intoelementary matrices in G has an integer approximation which preservesconstant signals in band 1.

[0267] Theorem 11.1.

[0268] Any matrix in the group G can be factored into a product ofelementary matrices in G.

[0269] Proof.

[0270] We perform the same reduction using elementary row operations asin the previous section, but with an extra restriction on theoperations. When we have two nonzero entries a(z) and b(z) in the columnwe are currently working on, we wish to perform an elementary rowoperation which either subtracts a multiple of a(z) from b (z) so as toreduce its ‘degree,’ or subtracts a multiple of b(z) from a(z) so as toreduce its ‘degree.’ For an elementary row operation to correspond to amatrix in G, it must meet the following restriction: if it subtracts amultiple of row 1 from another row, then the multiplier must bedivisible by z−1.

[0271] If neither a(z) nor b(z) is in row 1, then the restriction doesnot apply, and the usual reduction step is allowed. Now say a(z) is inrow 1. If the ‘degree’ of a(z) is greater than or equal to that of b(z),then we can subtract a suitable multiple of b(z) from a(z); again thisis always allowed. If the ‘degree’ of a(z) is less than that of b(z),then we want to subtract a multiple of a(z) from b(z) so as to eliminateat least one leading or trailing coefficient from b(z). (We are notrequiring that the ‘degree’ of b(z) be reduced all the way below that ofa(z); reducing it by at least one will suffice.) So in fact we couldmake the multiplier a monomial cz^(k) chosen so that the leading term ofcz^(k)a(z) is the same as that of b(z). But the multipliercz^(k)−cz^(k−1) would also work to eliminate the leading term of b(z),and it would not introduce new trailing terms because the ‘degree’ ofcz^(k−1)(z−1)a(z) is one more than that of a(z), and hence not more thanthat of b(z). So this multiplier will give a valid reduction step, whilesatisfying the G restriction.

[0272] Let us require that column 1 be the first column reduced in thisway. After column 1 is reduced, the remaining nonzero entry in thiscolumn must be in row 1, because the matrix will still be in G. Then onecan proceed to reduce the other columns as described in the previousection; these row operations do not involve row 1, so they are allallowed. And the steps for eliminating the remaining entries in excludedrows never require subtracting a multiple of row 1 from another row(since row 1 was the first row excluded), so they are allowed as well.

[0273] So we can reduce to a permuted diagonal matrix. Since the upperleft entry of the matrix is still nonzero, the permutation does not moveindex 1. So one can perform a sequence of swap-and-negate operations notinvolving row 1 so as to reduce to an actual diagonal matrix; theseoperations can be expressed as elementary operations using (10.1), andthese operations are allowed because they do not involve row 1.

[0274] We are now left with a diagonal matrix of determinant 1 whoseentries are monomials in z; and the monomial in the upper left cornermust have coefficient 1 in order for the matrix to be in G. This matrixcan be factored into essentially 2×2 diagonal factors of the formconsidered earlier, where each diagonal entry is a monomial in z: onebetween rows 1 and 2, one between rows 2 and 3, and so on. Each of thesefactors can be broken down into elementary matrices using the formula$\begin{pmatrix}\alpha & 0 \\0 & \alpha^{- 1}\end{pmatrix} = {\begin{pmatrix}1 & 1 \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\{a - 1} & 1\end{pmatrix}\begin{pmatrix}1 & {- \alpha^{- 1}} \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\{{- \left( {\alpha - 1} \right)}\alpha} & 1\end{pmatrix}}$

[0275] (this is (2. 1) with r=1 and s=α−1). For the first factor, α isof the form z^(k), so α−1 and (α−1)α are divisible by z−1; thus, theelementary matrices here are in G. For the remaining factors therestriction does not apply. This completes the factorization intoelementary matrices in G.

[0276] As in the preceding section, one can get by with far fewerfactors by using unit triangular matrices rather than elementarymatrices in the n×n case.

[0277] Again, if one has a causal transformation of determinant 1 andwants causal elementary factors, one can get them by the same procedure,using ordinary polynomial degrees instead of ‘degrees’ and always tryingto eliminate leading coefficients rather than “leading or trailing”coefficients. (The entries in the final diagonal matrix will beconstants.) So any causal matrix in G can be factored into causalelementary matrices in G.

[0278] For a causal transformation whose determinant has z factors, onecan first check whether the first column of the matrix is divisible by apower of z; if so, this power can be pulled out as a diagonal matrix onthe right (which just shifts band 1, and hence preserves a constantsignal in this band). Once this is done, the first column can be reducedas usual, and then the Smith normal form process can be applied to thelower right (n−1)×(n−1) submatrix. Then a diagonal matrix can be pulledout on the left (in two parts, as at the end of the preceding section),and the reduction of row 1 using the remaining rows (which now look likethe identity matrix) can be completed.

[0279] In the proof of Theorem 11.1, the first row and column of thematrix must be handled specially, but there is no restriction on theremaining rows and columns; they can be reduced by any Euclideanalgorithm steps desired. This extra freedom can be used to obtainadditional properties of the factorization, if desired.

[0280] For instance, suppose k<n is fixed, and we are given an n×nmatrix A(z) over R[z,z⁻¹] of determinant 1 with the property thatA(1)e₁=e₁ for all i≦k, where e_(i) is a vector with 1 in entry i and 0elsewhere. In other words, the transformation given by A(z) preserves asingle-band constant signal in any of the first k bands. Then A(z) canbe factored into elementary matrices which also preserve constantsignals in these bands. To see this, first perform the reduction on thefirst column as in the proof of Theorem 11.1, where now the first k rowsare restricted (any elementary operation with one of these rows as thesource must use a multiplier divisible by z−1). We can continue thisuntil no more legal reductions are possible; at this point all of theunrestricted rows will have been zeroed out. Since the determinant ofthe matrix is 1, the remaining nonzero entries in the column must havegreatest common divisor 1, so we can obtain the number 1 as a sum ofmultiples (by elements of R[z,z⁻¹]) of these entries. Using the samemultipliers with an additional factor of z−1, we can perform legalelementary operations so as to make the entry in row n of this columnequal to z−1. Now, since the first entry in this column is 1 plus amultiple of z−1 (this was true at the start, and all operationspreserved it), we can perform one more elementary operation from row nto row 1 to change this first entry to 1. Now legal elementaryoperations from row 1 can be used to zero out all of the other entries(which are multiples of z−1). Next we proceed to the second column anddo the same thing in rows 2 through n; then we can easily eliminateentry 1 in this row using the 1 in entry 2. Proceed this way through thefirst k columns, and then use the unrestricted algorithm to handle therest.

Small-Stencil Factorizations

[0281] When we factor a z-transform matrix into elementary matrices, weare decomposing the corresponding transformation into steps which allowonly a very specific form of interaction between parts of the signal.However, this form of interaction can still be very wide-ranging,because arbitrary powers of z are allowed in the multipliers occurringin the elementary factors. One may want to restrict the factors furtherso as to require the interactions to be fairly local.

[0282] Let us consider the case of 2×2 matrices first. This is the casewhere the signal (long sequence of numerical values) is broken up intotwo-entry blocks. An elementary matrix factor with nonzero entry in theupper right corner will modify this signal by leaving the second entriesin the blocks alone, but adding some linear combination of the secondentries to the first entries. If the nonzero entry is in the lower leftcorner, then a linear combination of the first entries will be added tothe second entries.

[0283] A natural locality restriction would be to require that thenumber added to a second entry be computed from the two neighboringfirst entries (the one in the same block and the one in the next block),and the number added to a first entry be computed from the twoneighboring second entries (the one in the same block and the one in theprevious block). This means that we allow only elementary matrix factorsof the forms ${\begin{pmatrix}1 & {{r\quad z} + s} \\0 & 1\end{pmatrix}\quad {and}\quad \begin{pmatrix}1 & 0 \\{{r\quad z^{- 1}} + s} & 1\end{pmatrix}},$

[0284] where r and s are constants.

[0285] For n×n matrices where n>2, it is less obvious what the exactmeaning of “small-stencil” or “local” elementary matrix should be. Onecould allow only nearest-neighbor interactions as in the 2×2 case, butthis would be extremely restrictive; it would allow only elementarymatrices where the nonzero off-diagonal entry is a constant adjacent tothe diagonal, a monomial rz in the upper right corner, or a monomialrz⁻¹ in the lower left corner. It would be more flexible to allowinteractions between the i'th entry in a block and the two closest j'thentries, one on each side. This would allow the nonzero off-diagonalentry of the elementary matrix to occur anywhere, but: if it is abovethe diagonal, it must be of the form rz+s; if it is below the diagonal,it must be of the form rz⁻¹+s . (Or one may want a different definitionhere if one is trying to meet particular implementation restrictions.)

[0286] It turns out that, even with the restrictive nearest-neighbordefinition, it is always possible to factor a z-transform matrix ofdeterminant 1 into small-stencil elementary factors. Since we alreadyknow how to factor such a matrix into unrestricted elementary factors,we just need to express a given elementary matrix as a product ofsmall-stencil elementary matrices. Next note that, because of equationssuch as $\begin{matrix}{{\begin{pmatrix}1 & {a + b} \\0 & 1\end{pmatrix} = {\begin{pmatrix}1 & a \\0 & 1\end{pmatrix}\begin{pmatrix}1 & b \\0 & 1\end{pmatrix}}},} & (12.1)\end{matrix}$

[0287] we may assume that the nonzero off-diagonal entry in the givenelementary matrix is a monomial.

[0288] In terms of the unblocked signal, this transformation adds ctimes entry number i+kn to entry number j+kn for all integers k, wherec, i, and j are given constants (and j−i is not divisible by n). If thisis not already a nearest-neighbor interaction (i.e., |j−i|>1), then itcan be changed into one by using nearest-neighbor swaps to move theinteracting entries closer to each other. For instance, if j−i>1, thenwe can swap entry j−1+kn with entry j+kn for all integers k. This willnot move the entries in positions i+kn unless j−1−i is divisible by n,in which case these entries are moved one place to the right. So theinteracting entries will end up one or two places closer to each other.Repeat this until the interacting entries are adjacent, do the operationwhich performs the interaction, and then reverse all the swaps. Thisfactors the nonlocal operation into a sequence of local operations(including swaps).

[0289] We do not want to use literal swaps, though, since they havedeterminant −1 as linear operations. Instead, we negate one of the twoentries being swapped; this ‘swap’ or swap-and-negate is a 90-degreerotation between two bands.

[0290] Returning to the z-transform matrices, this states that we canfactor our non-local monomial elementary matrix into a local monomialelementary matrix and a number of local ‘swaps.’ A local ‘swap’ whichdoes not cross block boundaries looks like an identity matrix exceptthat some 2×2 block centered on the diagonal is changed to$\begin{pmatrix}0 & 1 \\{- 1} & 0\end{pmatrix}\quad {or}\quad {\begin{pmatrix}0 & {- 1} \\1 & 0\end{pmatrix}.}$

[0291] If the ‘swap’ does cross a block boundary, then it is an identitymatrix with the four corner entries changed to $\begin{pmatrix}0 & z \\{- z^{- 1}} & 0\end{pmatrix}\quad {or}\quad {\begin{pmatrix}0 & {- z} \\z^{- 1} & 0\end{pmatrix}.}$

[0292] Here is an example with n=3: $\begin{pmatrix}1 & 0 & 0 \\0 & 1 & 0 \\{7z} & 0 & 1\end{pmatrix} = {\begin{pmatrix}1 & 0 & 0 \\0 & 0 & {- 1} \\0 & 1 & 0\end{pmatrix}\begin{pmatrix}0 & {- 1} & 0 \\1 & 0 & 0 \\0 & 0 & 1\end{pmatrix}\begin{pmatrix}0 & 0 & z \\0 & 1 & 0 \\{- z^{- 1}} & 0 & 0\end{pmatrix} \times \begin{pmatrix}1 & 0 & 0 \\0 & 1 & 0 \\0 & {- 7} & 1\end{pmatrix}\begin{pmatrix}0 & 0 & {- z} \\0 & 1 & 0 \\z^{- 1} & 0 & 0\end{pmatrix}\begin{pmatrix}0 & 1 & 0 \\{- 1} & 0 & 0 \\0 & 0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 & 0 \\0 & 0 & 1 \\0 & {- 1} & 0\end{pmatrix}}$

[0293] It now remains to note that each local ‘swap’ can be factoredinto three local elementary matrices. For the case where the ‘swap’ doesnot cross a block boundary we just use (10.1). If the ‘swap’ does crossa block boundary we use a very similar formula: $\begin{pmatrix}0 & {- z} \\z^{- 1} & 0\end{pmatrix} = {\begin{pmatrix}1 & 0 \\z^{- 1} & 1\end{pmatrix}\begin{pmatrix}1 & {- z} \\0 & 1\end{pmatrix}{\begin{pmatrix}1 & 0 \\z^{- 1} & 1\end{pmatrix}.}}$

[0294] In, summary, we have:

[0295] Proposition 12.1.

[0296] Any matrix over R[z, z⁻¹] of determinant 1 can be factored intosmall-stencil elementary matrices.

[0297] This holds under either definition of “small-stencil.” Note thata large number of factors may be required.

[0298] If a given z-transform matrix has determinant a monomial otherthan 1, then of course it cannot be factored into small-stencilelementary factors, because it cannot be factored into elementaryfactors at all. But if we allow a simple one-step shift and/or negationin one band (i.e., the identity matrix with one diagonal entry changedto ±z^(±1) or −1) to be considered “small-stencil,” then a factorizationinto small-stencil factors can be achieved. To see this, recall fromprevious sections that one can factor the given matrix into elementarymatrices and diagonal matrices with diagonal entries of the form ±z^(k);the elementary parts are handled as above, and the diagonal parts aretaken care of by these new factors. Similar remarks apply in the nexttwo sections.

Simultaneous Small-Stencil and Constant-Preserving Factors

[0299] In the preceding two sections we considered two extra propertiesthat can be achieved in a factorization of a suitable z-transformmatrix. Is it possible to achieve both of these properties at the sametime?

[0300] First consider the more flexible definition of “small-stencil”from the previous section; we will see that suitable factorizations doexist in this case. Suppose we are given a matrix in the group G. We canfactor the given matrix into elementary matrices in G using the methodsdiscussed in the section entitled “Factors Which Preserve ConstantSignals”; some of these have the nonzero off-diagonal entry in the firstcolumn, and others do not. For the ones which do not, the off-diagonalentry is unrestricted; we may assume that the off-diagonal entry is amonomial because of (12.1). This matrix can now be factored into a localelementary matrix and some local ‘swaps’ using the method described inthe previous section.

[0301] For the elementary matrices with off-diagonal entry in the firstcolumn, we are not allowed to reduce to the case of monomials; instead,using (12.1), we can reduce to the case where the off-diagonal entry hasthe form c(z^(k)−z^(k−1)) for some real constant c and some integer k.This means that c times an entry in the source band is added to an entryin the destination band and subtracted from the next entry in thedestination band. By performing a suitable sequence of ‘swaps,’ one canarrange for each source entry to lie in between the two destinationentries it will affect. Then the desired operation will be small-stencilunder the flexible definition. Afterwards the ‘swaps’ can be reversed torestore the bands to their original positions.

[0302] The factors here are not in the group G; instead of leaving theconstant signal alone on the first band, they move it around from bandto band. But the elementary operations specified above do leave theconstant signal unchanged on whatever band it is currently in when theoperations are applied. As for the ‘swaps,’ when one of these isfactored into three elementary steps, the constant signal may appear ontwo bands simultaneously, but it will be restored to a single band(although perhaps negated) by the time the ‘swap’ is complete. So thecorresponding integer approximation maps will always leave this integerconstant signal unaltered (somewhere), and when all of the factors havebeen performed the constant signal will end up where it started,unchanged by the integer approximation map.

[0303] Now suppose we want to use the restrictive nearest-neighbordefinition of “small-stencil.” Here we assume n≧3, because the case n=2is already handled above. The same procedure described above works here,except that the elementary operation adding the band containing theconstant signal to another band (multiplied by c in one direction and by−c in the other direction) is no longer allowed and must be decomposedfurther.

[0304] Suppose that the band currently containing the constant signal isband i, and we want to add it to band j: for each entry x in band i, weare to add cx to the nearest entry to the right in band j and subtractcx from the nearest entry to the left in band j. Let j′ be a bandadjacent to j which is not band i. Now perform the following procedure:

[0305] subtract c times band j′ from band j;

[0306] move band i up to band j′−1;

[0307] add band j′−1 to band j′;

[0308] move band j′−1 down to band j′+1;

[0309] subtract band j′+1 from band j′;

[0310] move band j′+1 up to band i;

[0311] add c times band j′ to band i;

[0312] move band i up to band j′−1;

[0313] subtract band j′−1 from band j′;

[0314] move band j′−1 down to band j′+1;

[0315] add band j′+1 to band j′;

[0316] move band j′+1 up to band i.

[0317] Each “add” or “subtract” here is a nearest-neighbor elementaryoperation. “Move band i up to band j′−1” means that, if band i is notalready immediately below band j′ (if it is, do nothing), then ‘swap’band i with band i+1 (the band moving from i+1 to i is the one which isnegated), then ‘swap’ band i+1 with band i+2, and so on, wrapping aroundfrom n to 1 if necessary, until the band being moved reaches j′−1 (or nif j′=1). The other “move” steps are interpreted similarly. Each ofthese ‘swaps’ is factored into three nearest-neighbor elementaryoperations as usual.

[0318] One can check that the net effect of this procedure is asdesired: for each entry x in band i, the procedure adds cx to thenearest entry to the right in band j and subtracts cx from the nearestentry to the left in band j. When it is applied to input containing aconstant signal in band i and nothing elsewhere, the “subtract c times”operation has no effect, the next five steps add the constant signal toband j′ and then subtract it from band j′ for no net effect, the “add ctimes” step does nothing because there is nothing currently in band j′,and the last five steps again subtract and add the same signal from bandj′ for no net effect. So this procedure preserves the constant signal inband i.

[0319] Thus, even under the strictest definition of “small-stencil,” onecan find a factorization of a given matrix in G into elementary factorsso that the resulting integer approximation map φ preserves a constantsignal in band 1.

[0320] But suppose one does not want the constant signal to roam fromone band to another in this way. Is it still possible to achieve asmall-stencil and constant-preserving factorization? In other words, canevery matrix in the group G be factored into small-stencil elementaryfactors which are also in the group G?

[0321] The answer to this also turns out to be yes, if the flexibledefinition of “small-stencil” is used. Let us first consider the 2×2case. We can factor the given matrix into elementary matrices in G asbefore. Again as before, we can reduce to the case where the nonzerooff-diagonal entry of an elementary matrix is a monomial if it is not incolumn 1, and is of the form c(z^(k)−z^(k−1)) if it is in column 1.

[0322] If an elementary matrix has as its nonzero entry a monomial atthe upper right, we can handle it using the factorization${\begin{pmatrix}1 & {c\quad z^{{2k} + i}} \\0 & 1\end{pmatrix} = {\begin{pmatrix}z & 0 \\0 & z^{- 1}\end{pmatrix}^{k}\begin{pmatrix}1 & {c\quad z^{i}} \\0 & 1\end{pmatrix}\begin{pmatrix}z & 0 \\0 & z^{- 1}\end{pmatrix}^{- k}}},$

[0323] where k is an integer and i is 0 or 1. The elementary matrixappearing on the right here is small-stencil, and the z-shift diagonalmatrix has the following small-stencil factorization in G:$\begin{matrix}{\begin{pmatrix}z & 0 \\0 & z^{- 1}\end{pmatrix} = \quad {\begin{pmatrix}1 & 0 \\1 & 1\end{pmatrix}\begin{pmatrix}1 & {z^{- 1} - 1} \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\{{- \frac{1}{2}}z} & 1\end{pmatrix} \times}} \\{\quad {\begin{pmatrix}1 & {2 - {2z^{- 1}}} \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\\frac{1}{2} & 1\end{pmatrix}\begin{pmatrix}1 & {z^{- 1} - 1} \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\{- z} & 1\end{pmatrix}}}\end{matrix}$

[0324] The other elementary matrix, with a binomial at the lower left,is handled by the factorization $\begin{pmatrix}1 & 0 \\{c\left( {z^{{2k} + i} - z^{{2k} + i - 1}} \right)} & 1\end{pmatrix} = {\begin{pmatrix}z & 0 \\0 & z^{- 1}\end{pmatrix}\begin{pmatrix}1 & 0 \\{c\quad {z^{i}\left( {1 - z^{- 1}} \right)}} & 1\end{pmatrix}{\begin{pmatrix}z & 0 \\0 & z^{- 1}\end{pmatrix}^{k}.}}$

[0325] The elementary matrix on the right here is small-stencil if i=0.If i=1, we need to factor it further: $\begin{matrix}{\begin{pmatrix}1 & 0 \\{c\left( {z - 1} \right)} & 1\end{pmatrix} = \quad {\begin{pmatrix}1 & c^{- 1} \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\{{c\quad z^{- 1}} - c} & 1\end{pmatrix} \times}} \\{\quad {\begin{pmatrix}1 & {{- \frac{1}{2}}c^{- 1}z} \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\{{2c} - {2c\quad z^{- 1}}} & 1\end{pmatrix}\begin{pmatrix}1 & {{- \frac{1}{2}}c^{- 1}} \\0 & 1\end{pmatrix}}}\end{matrix}$

[0326] This completes the factorization into small-stencil elementarymatrices in G.

[0327] For the n×n case, first factor the matrix into elementarymatrices in G as in the previous section entitled “Factors WhichPreserve Constant Signals.” Each of these elementary matrices onlyaffects two of the n bands, so it can be factored into local elementarymatrices by the methods for the 2×2 case above; under the more flexibledefinition, these factors are small-stencil.

[0328] If the strict nearest-neighbor definition of “small-stencil” isused, then there are n×n matrices in G for n≧3 which cannot be factoredinto small-stencil elementary factors in G. In fact, a small-stencilelementary factor in G cannot have its nonzero off-diagonal entry in thefirst column, so only matrices with leftmost column e₁ can be productsof such matrices.

[0329] So we have:

[0330] Proposition 13.1.

[0331] Under either definition of “small-stencil,” any n×n matrix A in Gcan be factored into small-stencil elementary matrices so that thecorresponding integer approximation preserves constant signals inband 1. Furthermore, any matrix in G can be factored into small-stencilelementary matrices in G under the flexible definition of“small-stencil,” but (if n≧3) not under the strict definition.

[0332] The results in this section and the preceding one seem toindicate that requiring factors to be small-stencil substantiallyincreases the size of the factorization. However, this is normally trueonly when one is factoring unusual matrices with long-scale but noshort-scale interactions. For more typical matrices consisting ofLaurent polynomials with no gaps in their coefficients, it is common toobtain the small-stencil property with no additional effort during thefactorization process, or with only a small amount of care when one hasa choice to make. An example of obtaining the small-stencil propertywith no added effort is shown in the upcoming section entitled “Example:The 9-7 Wavelet.”

Causality and Small-Stencil Factorizations

[0333] We noted earlier that the the algorithms in the sections entitled“Factoring A z-Transform Matrix” and “Factors Which Preserve ConstantSignals” need only slight modification so that, when applied to a causalmatrix (one where no negative powers of z occur), they yield causalfactor matrices. However, the algorithms in the sections entitled“Small-Stencil Factorizations” and “Simultaneous Small-Stencil AndConstant-Preserving Factors” involve moving bands back and forth, thusintroducing non-causal factors even when the original matrix is causal.If one wants a factorization into small-stencil elementary matriceswhich are also causal, then one will need modified methods, at least.

[0334] For an elementary matrix to be both causal and small-stencil, itsnonzero off-diagonal entry must be of the form rz+s. If the flexibledefinition of “small-stencil” is used, then the z-coefficient r isallowed to be nonzero only for entries above the diagonal. The strictdefinition of small-stencil imposes stronger restrictions: theoff-diagonal entry must be a constant adjacent to the diagonal or amonomial rz in the upper right corner (in the 2×2 case, a binomial rz+sis allowed in the upper right corner).

[0335] It turns out that, in the 2×2 case, causal small-stencilfactorizations cannot always be attained:

[0336] Proposition 14.1.

[0337] There exists a 2×2 matrix over R[z] of determinant 1 which cannotbe expressed as a product of causal small-stencil elementary matrices.

[0338] Proof.

[0339] Suppose a given non-constant 2×2 matrix A can be written as aproduct of causal small-stencil elementary matrices. A factor with anentry rz+s can be split up into a factor with entry rz and a factor withentry s. So A can be written as a product of constant matrices ofdeterminant 1 and elementary matrices with upper right entry of the formrz. Express A as such a product with a minimal number of factors. (Inthis product, the two factor types must alternate, because two adjacentfactors of the same type could be combined into one. Note that at leastone rz factor must occur.)

[0340] The last factor in this product has at least one nonzero entry inits bottom row; select a column (column 1 or column 2) whose bottomentry in that last factor is nonzero. (If the last factor is an rzfactor, column 2 will be selected.) Now multiply out this product ofmatrices from right to left. We will show by induction that, at eachstage of this process (starting after the first rz matrix has beenmultiplied in), the two entries in the selected column of the partialproduct will have degrees differing by at most 1; in fact, if the lastmatrix multiplied in was an rz matrix, then the upper entry in theselected column will have degree 1 more than the degree of the lowerentry.

[0341] The partial product just before the first rz matrix is multipliedin is constant, and its selected column has nonzero lower entry. Hence,after the rz matrix is multiplied in, the upper entry in the selectedcolumn will have degree 1 and the lower entry will have degree 0.

[0342] Suppose that (after multiplying by an rz matrix) the selectedcolumn in the current product has upper entry of degree d and lowerentry of degree d−1. Then, after multiplying by a constant matrix ofnonzero determinant, one of the two entries will have degree d and theother will have degree d−1 or d. The only way in which the lower entrywill still have degree d−1 is if the lower left entry of the constantmatrix is 0.

[0343] Now suppose we have just multiplied in a constant matrix ofdeterminant 1, and are about to multiply in an rz matrix (not thefirst), and the selected column has entries of degrees differing by atmost 1. Say the larger of the two degrees is d. The constant matrix justmultiplied in cannot have lower left entry 0, because if it did we wouldhave three consecutive factors of the form ${\begin{pmatrix}1 & {r\quad z} \\0 & 1\end{pmatrix}\begin{pmatrix}\alpha & s \\0 & \alpha^{- 1}\end{pmatrix}\begin{pmatrix}1 & {r^{\prime}z} \\0 & 1\end{pmatrix}},$

[0344] and these could be replaced by the two factors ${\begin{pmatrix}1 & {\left( {r + {r^{\prime}\alpha^{2}}} \right)z} \\0 & 1\end{pmatrix}\begin{pmatrix}\alpha & s \\0 & \alpha^{- 1}\end{pmatrix}},$

[0345] contradicting the minimality of the factorization. So the lowerentry of the selected column currently has degreed, while the upperentry has degree d−1 or d. After multiplying by the new rz matrix, thedegree of the upper entry of the selected column will be d+1 and thedegree of the lower entry will be d. This completes the induction.

[0346] In particular, the selected column of the final product A willhave entries with degrees differing by at most 1. Now, the matrix$\begin{pmatrix}1 & z^{2} \\0 & 1\end{pmatrix}\quad$

[0347] is a 2×2 matrix of determinant 1 which has no column whoseentries have degrees differing by at most 1. Therefore, this matrixcannot be the matrix A above; in other words, this matrix cannot befactored into causal small-stencil elementary factors.

[0348] On the other hand, the presence of a third band yields enoughextra freedom to allow causal small-stencil factorizations:

[0349] Proposition 14.2.

[0350] If n>2, then every n×n matrix over R[z] of determinant 1 can beexpressed as a product of causal small-stencil elementary matrices.

[0351] (This is true even under the strictest definition of“small-stencil.”)

[0352] Proof.

[0353] We already know that such a matrix can be written as a product ofcausal elementary matrices. By (12.1), these elementary matrices can befactored into monomial elementary matrices, where the nonzerooff-diagonal entry has the form cz^(k) for some non-negative integer k.So it suffices to show that such a monomial elementary matrix can bewritten as a product of causal small-stencil elementary matrices.

[0354] If k=0, then we can do this using nearest-neighbor ‘swaps’ andone nearest-neighbor elementary matrix as in the section entitled“Small-Stencil Factorizations” the resulting factors are all constantand hence causal. For k=1, note that the elementary matrix with upperright entry cz (i.e., adding cz times band n to band 1) is causal andsmall-stencil; by combining this with constant nearest-neighbor ‘swaps’(moving the source band to band n and the destination band to band 1,without wrapping around), we can handle an elementary matrix which addscz times band i to band j.

[0355] Once we know how to add cz^(k) times band i to band j for any iand j, we can add cz^(k+1) times band i to band j for any i and j asfollows: pick a band j′ different from i and j, and do:

[0356] add z times band i to band j′;

[0357] add cz^(k) times band j′ to band j;

[0358] add −z times band i to band j′;

[0359] add −cz^(k) times band j′ to band j.

[0360] Therefore, by induction, we can handle any monomial elementarymatrix.

[0361] What if we want a causal small-stencil factorization which alsopreserves a one-band constant signal? Let us first assume we are usingthe flexible version of “small-stencil.” The strong version of constantpreservation where the constant signal must be held in band 1 only(i.e., matrices in the group G) cannot be achieved here, because allcausal small-stencil elementary matrices in G have first column e₁, soany product of such matrices also has first column e₁.

[0362] However, if we put the constant signal in band n instead, then afactorization which is causal, small-stencil, and stronglyconstant-preserving can be attained. (It follows that, if the constantband is allowed to “roam,” then the factorization can be achieved nomatter which band initially contains the constant signal.) To see this,note that, using permissible factors, we can add c(z−1) times band i toband j if j<i, we can add c times band i to band j if i<n, and we can‘swap’ bands i and j if i, j<n. Next, we can add c(z−1) times band i toband j if i<j: if j<n, ‘swap’ bands i and j, add −c(z−1) times band j toband i, and ‘swap’ bands j and i; if j=n, find j′ different from i andj, add c(z−1) times band i to band j′, add band j′ to band n, subtractc(z−1) times band i from band j′, and subtract band j′ from band n.Hence, using the recursive method from Proposition 14.2, we can addc(z−1)^(k) times band i to band j for any distinct i and j, where k isany nonnegative integer if i<n and k is any positive integer if i=n.

[0363] Now, given a causal matrix which preserves a constant in band n,we can factor it into causal elementary matrices preserving a constantin band n by the methods of described in the section entitled “FactorsWhich Preserve Constant Signals.” Such an elementary matrix adds p(z)times band i to band j, where the polynomial p(z) must be divisible byz−1 if i=n. To factor this matrix further, just expand p(z) in powers ofz−1; each term in this expansion can be handled by the above, so, by(12.1), the whole elementary matrix can be factored into permissiblefactors.

[0364] For the strict version of “small-stencil,” we already know by theargument from the section entitled “Simultaneous Small-Stencil AndConstant-Preserving Factors” that strong constant preservation cannot beachieved (this argument works no matter which band contains theconstant). However, if we allow the constant to roam, then a causalsmall-stencil factorization is possible. For this, it suffices by theabove to be able to add c(z−1) times the constant band to another band;this can be done by a version of the twelve-step method from the sectionentitled “Simultaneous Small-Stencil And Constant-Preserving Factors.”Specifically, this method does not “wrap around”; instead, it handlesthe z part by moving the constant band to band n and the intermediatedestination band “j′” to band 1.

EXAMPLE The 9-7 Wavelet

[0365] As an example of the methods presented here, we consider a 9-7wavelet which has been found to be well-suited for image compression andis in common use.

[0366] The exact formulas for the filter coefficients for this waveletare given in the FBI fingerprint compression standard. The coefficientsare expressed in terms of x₁, where$x_{1} = {\left( \frac{{{- 14}\sqrt{15}} + 63}{1080\sqrt{15}} \right)^{1/3} + \left( \frac{{{- 14}\sqrt{15}} - 63}{1080\sqrt{15}} \right)^{1/3} - \frac{1}{6}}$

[0367] is the real root of the equation

20x ₁ ³+10x ₁ ²+4x ₁+1=0

[0368] The referenced formulas also use a complex number x₂, but theycan be expressed in terms of x₁ using the formulas

|x ₂|² =x ₁ ² +x ₁/2+1/5 and

Rx ₂ =−x ₁/2−1/4,

[0369] so x₂ is not needed. The filter coefficents then become:$\begin{matrix}{{h_{0}(0)} = \quad {{- \sqrt{2}}{{x_{1}\left( {{240x_{1}^{2}} + {160x_{1}} + 83} \right)}/32}}} \\{\approx \quad 0.8526986790094034} \\{{h_{0}\left( {\pm 1} \right)} = \quad {{- \sqrt{2}}{{x_{1}\left( {{160x_{1}^{2}} + {90x_{1}} + 37} \right)}/32}}} \\{\approx \quad 0.3774028556126538} \\{{h_{0}\left( {\pm 2} \right)} = \quad {{- \sqrt{2}}{{x_{1}\left( {{10x_{1}^{2}} - 3} \right)}/8}}} \\{\approx \quad {- 0.1106244044184234}} \\{{h_{0}\left( {\pm 3} \right)} = \quad {5\sqrt{2}{{x_{1}\left( {{2x_{1}} + 1} \right)}/32}}} \\{\approx \quad {- 0.0238494650193800}} \\{{h_{0}\left( {\pm 4} \right)} = \quad {{- 5}\sqrt{2}{x_{1}/64}}} \\{\approx \quad 0.0378284555069955} \\{{h_{1}\left( {- 1} \right)} = \quad {\sqrt{2}{\left( {{6x_{1}} - 1} \right)/\left( {16x_{1}} \right)}}} \\{\approx \quad 0.7884856164056644} \\{{h_{1}\left( {- 2} \right)} = \quad {{h_{1}(0)} = {{- \sqrt{2}}{\left( {{16x_{1}} - 1} \right)/\left( {64x_{1}} \right)}}}} \\{\approx \quad {- 0.4180922432222122}} \\{{h_{1}\left( {- 3} \right)} = \quad {{h_{1}(1)} = {{- \sqrt{2}}{\left( {{2x_{1}} + 1} \right)/\left( {32x_{1}} \right)}}}} \\{\approx \quad {- 0.0406894176095584}} \\{{h_{1}\left( {- 4} \right)} = \quad {{h_{1}(2)} = {{- \sqrt{2}}/\left( {64x_{1}} \right)}}} \\{\approx \quad 0.0645388826289384}\end{matrix}$

[0370] The z matrix specified by these filter coefficients is${{M(z)} = \begin{pmatrix}{a_{11}(z)} & {a_{12}(z)} \\{a_{21}(z)} & {a_{22}(z)}\end{pmatrix}},$

[0371] where $\begin{matrix}{{a_{11}(z)} = {{{h_{0}\left( {- 4} \right)}z^{- 2}} + {{h_{0}\left( {- 2} \right)}z^{- 1}} + {h_{0}(0)} + {{h_{0}(2)}z} + {{h_{0}(4)}z^{2}}}} \\{{a_{12}(z)} = {{{h_{0}\left( {- 3} \right)}z^{- 1}} + {h_{0}\left( {- 1} \right)} + {{h_{0}(1)}z} + {{h_{0}(3)}z^{2}}}} \\{{a_{21}(z)} = {{{h_{1}\left( {- 4} \right)}z^{- 2}} + {{h_{1}\left( {- 2} \right)}z^{- 1}} + {h_{1}(0)} + {{h_{1}(2)}z}}} \\{{a_{22}(z)} = {{{h_{1}\left( {- 3} \right)}z^{- 1}} + {h_{1}\left( {- 1} \right)} + {{h_{1}(1)}z}}}\end{matrix}$

[0372] It is already known to those of ordinary skill in the art how tofactor M(z) into four elementary matrices and a constant diagonalmatrix. In fact, these factors have the same symmetry as the matrixitself. The factors are also small-stencil; however, the integerapproximation (of course, one has to factor the constant diagonal matrixfurther to get the integer approximation) does not preserve the constantsignal. This is inevitable using symmetric factors, because requiringsymmetry makes the factorization essentially unique. We will see thatthe use of asymmetric factors gives the extra freedom necessary forconstant preservation while still using small-stencil factors. Since thegiven matrix is not causal, we do not need to look for causal factors.

[0373] The determinant of M(z) is 1. However, M(z) does not send aconstant signal with value k to a constant value k on band 1 (thelow-pass filter) and zero on band 2 (the high-pass filter); it sendsthis constant signal to a constant {square root}{square root over (2)}kon band 1 and zero on band 2. We therefore pull out a constant diagonalscaling matrix factor $S = \begin{pmatrix}\sqrt{2} & 0 \\0 & {1/\sqrt{2}}\end{pmatrix}$

[0374] and work with the matrix S⁻¹M(z) from now on; for applicationssuch as compression this scaling factor makes little difference anywayand is less important than constant preservation.

[0375] Next, as specified earlier, we pull out a factor$\Delta = \begin{pmatrix}1 & 0 \\{- 1} & 1\end{pmatrix}$

[0376] from the right, leaving a matrix A(z)=S⁻¹M(z)Δ⁻¹ satisfyingA(1)e₁=e₁. We will now work out a small-stencil constant-preservingfactorization for A(z) (more efficient than the one described earlier inthe section entitled Simultaneous Small-Stencil and Constant-PreservingFactors).

[0377] We start eliminating in the first column. First we do a rowoperation to eliminate the z²-term from a₁₁. (Note that this must alsoeliminate the z²-term from a₁₂, because otherwise the determinant of thenew matrix would have a z³-term.) We have an extra degree of freedomhere, so we can eliminate the z⁻²-term from a₁₁ (and the z⁻¹-term froma₁₂) at the same time; this is what the usual symmetric factorizationprocess does. Next, we do a row operation to eliminate the z⁻²-term froma₂₁; here there is no extra degree of freedom, and we have to breaksymmetry to maintain constant preservation. The third step eliminatesone term from a₁₁, and this must be the trailing term (the z⁻¹-term ) inorder to make later factors small-stencil. The fourth operationeliminates the z-term from a₂₁, and the fifth operation eliminates thez-term from all. The remaining a₁₁ is a constant, and since the matrixis in the group G, the constant must be 1. In fact, we find that theremaining matrix is elementary (unit lower triangular) andsmall-stencil. This remaining factor can be combined with the factor Δ,which is also unit lower triangular. This yields the factorization:

S ⁻¹ M(z)=U ₁(z)L ₂(z)U ₃(z)L ₄(z)U ₅(z)L ₆(z)

[0378] where U_(i)(z) is small-stencil unit upper triangular andL_(i)(z) is small-stencil lower triangular:${{U_{i}(z)} = \begin{pmatrix}1 & {{r_{i}z} + s_{i}} \\0 & 1\end{pmatrix}},\quad {{L_{i}(z)} = {\begin{pmatrix}1 & 0 \\{{r_{i}z^{- 1}} + s_{i}} & 1\end{pmatrix}.}}$

[0379] The coefficients in these factor matrices are: $\begin{matrix}{r_{1} = {{5{x_{1}^{2}/2}\quad s_{1}} = r_{1}}} \\{r_{2} = {{{\left( {{20x_{1}^{2}} + 3} \right)/4}\quad s_{2}} = r_{2}}} \\{r_{3} = 0} \\{s_{3} = {\left( {{{- 410}x_{1}^{2}} - {90x_{1}} + 13} \right)/110}} \\{r_{4} = {{{{- \left( {{40x_{1}^{2}} + 5} \right)}/4}\quad s_{4}} = {- r_{4}}}} \\{r_{5} = {{{\left( {{{- 70}x_{1}^{2}} + {45x_{1}} + 21} \right)/55}\quad s_{5}} = 0}} \\{r_{6} = {{5x_{1}^{2}\quad s_{6}} = {r_{6} - 1}}}\end{matrix}$

[0380] The numerical values of these coefficients are: $\begin{matrix}{r_{1} \approx \quad 0.2930671710299618} & {\quad {s_{1} \approx 0.2930671710299618}} \\{r_{2} \approx \quad 1.3361343420599236} & {\quad {s_{2} \approx {- 1.3361343420599236}}} \\{r_{3} \approx \quad 0} & {\quad {s_{3} \approx {- 0.0386222501060046}}} \\{r_{4} \approx \quad 2.4222686841198471} & {\quad {s_{4} \approx 2.4222686841198471}} \\{r_{5} \approx \quad {- 0.0475120919539189}} & {\quad {s_{5} \approx 0}} \\{r_{6} \approx \quad 0.58613420599236} & {\quad {s_{6} \approx {- 1.5861343420599236}}}\end{matrix}$

[0381] Note that, for each i≦6, there is a simple relation between r_(i)and s_(i) (or one of them is 0). This means that, in each case, therounded value <r_(i)a+s_(i)b> for integer arguments a and b can becomputed by integer additions or subtractions together with a singleoperation of the form c

<rc>, where c is an integer and r is r_(i) or s_(i). If the latteroperation can be performed by lookup in a precomputed table, thenfloating-point arithmetic can be avoided altogether.

[0382] Since this factorization is not symmetric, it has a mirror-imageform which can be obtained by reversing the signal, applying thefactorization as above, and reversing again. To do this algebraically,we replace z with z⁻¹ and conjugate by $\begin{pmatrix}1 & 0 \\0 & z\end{pmatrix};$

[0383] note that this leaves M(z) unchanged. The effect on the factorsis to simply interchange r_(i) with s_(i) for each i.

[0384] We now perform some error analysis for this factorization,starting with the norm method.

[0385] Computing the norm of an arbitrary z-transform matrix appears tobe a messy nonlinear optimization problem, but for an elementary matrixit is feasible. Let p(z) be the nonzero off-diagonal entry of a 2×2elementary matrix B. Let b be the absolute value of the constant term ofp(z), and let a be the sum of the absolute values of the coefficients ofthe nonconstant terms of p(z). Then ∥B∥ is the maximum value of

{square root}{square root over (x²+(y+bx+a)²)}

[0386] for real numbers x and y such that x²+y²=1.

[0387] In fact, the same formula works for the norm of an n×n elementarymatrix with nonzero off-diagonal entry p(z). Here we need to maximize

{square root}{square root over (x²+(y+bx+a)²+x₃ ²+)}x₄ ²+ . . . +x_(n) ²

[0388] subject to the constraint x²+y²+x₃ ²+ . . . +x_(n) ²=1. This isequivalent to maximizing x²+(y+bx+a)²+x₃ ²+ . . . +x_(n) ²−x²+y²+x₃ ²+ .. . +x_(n) ²)=2y(bx+a)+(bx+a)² under this same constraint. If we hold xfixed, then the new objective function is linear in y, so it ismaximized at one of the two extreme values of y; these extreme valuesoccur when x₃= . . . =x_(n)=0. This reduces the problem to the 2×2 case.

[0389] Actually computing this maximum requires he solution of a quarticpolynomial equation in general, so one will normally resort to numericalapproximation. But there are some special cases where the answer issimpler:

[0390] if b=0, then ∥B∥=a+1; $\begin{matrix}{{{{if}\quad a} = 0},{then}} & {\quad {{{B} = \frac{b + \sqrt{b^{2} + 4}}{2}};}} \\{{{{if}\quad a} = b},{then}} & {{B} = \sqrt{\frac{2\left( {a^{4} + {5a^{2}} + 2 + {a\sqrt{a^{2} + 3}}} \right)}{a^{2} + 4}}}\end{matrix}$

[0391] For the matrices U_(i)(z) and L_(i)(z),we have a=|r₁| andb=|s_(i)|. Five of these six matrices fall under the special casesabove; we handle the remaining matrix L₆(z) by numerical methods. Theresulting matrix norms are: $\begin{matrix}{{{U_{1}(z)}} \approx 1.4037242452456795} \\{{{L_{2}(z)}} \approx 3.1167781378219282} \\{{{U_{3}(z)}} \approx 1.0194975674480953} \\{{{L_{4}(z)}} \approx 5.1277219075789451} \\{{{U_{5}(z)}} \approx 1.0475120919539189} \\{{{L_{6}(z)}} \approx 2.6101749209829466}\end{matrix}$

[0392] Now we can use (2.1) to compute error bounds: for the forwardtransform the error bound is about 29.0346469116757969; for the inversetransform (which can also be computed using norms because we are in the2×2 case) we get a bound of 39.6038983737180800.

[0393] These bounds can probably be improved by direct error analysis inthe manner discussed earlier, but this would require analyzing acombination of 17 separate errors (which are probably not independent).Instead we go to empirical methods. A random sample of over 4.6×10⁹ testcases (using random integers chosen uniformly from the interval [−2¹⁶,2¹⁶−1]) yielded the following worst errors:

[0394] Forward transform: error≈4.1636582346765949 for input$\begin{pmatrix}{- 2522} \\{- 16164}\end{pmatrix},\begin{pmatrix}{- 6636} \\658\end{pmatrix},\begin{pmatrix}{- 3046} \\{- 14296}\end{pmatrix},\begin{pmatrix}6398 \\10921\end{pmatrix},\begin{pmatrix}{- 6254} \\8138\end{pmatrix}$

[0395] Inverse transform: error≈4.0303761353834788 for input$\begin{pmatrix}757 \\10905\end{pmatrix},\begin{pmatrix}{- 15135} \\11419\end{pmatrix},\begin{pmatrix}{- 11480} \\511\end{pmatrix},\begin{pmatrix}6895 \\{- 1806}\end{pmatrix},\begin{pmatrix}{- 10013} \\11732\end{pmatrix}$

[0396] (One needs five successive input pairs of low-band, high-bandentries to compute one output pair.)

[0397] One might expect the alternate mirror-image form of thefactorization to have the same error bounds. However, the reflectionalso changes the pairing between the band-1 entries and the band-2entries. (When the input signal is split into length-2 vectors, eachband-i entry is paired with the entry that immediately follows it inband 2. After the reflection, these two entries will end up in separatevectors.) So there is no reason to expect the error bounds to beidentical. In fact, testing yields inputs with errors slightly worsethan those found for the unreflected factorization:

[0398] Forward transform: error≈4.2264010122204445 for input$\begin{pmatrix}12962 \\12976\end{pmatrix},\begin{pmatrix}{- 15095} \\{- 13917}\end{pmatrix},\begin{pmatrix}{- 4271} \\{- 3962}\end{pmatrix},\begin{pmatrix}12318 \\6625\end{pmatrix},\begin{pmatrix}{- 13212} \\{- 5853}\end{pmatrix}$

[0399] Inverse transform: error≈4.1588504091004767 for input$\begin{pmatrix}{- 4703} \\{- 8068}\end{pmatrix},\begin{pmatrix}{- 12506} \\{- 7893}\end{pmatrix},\begin{pmatrix}13822 \\{- 6129}\end{pmatrix},\begin{pmatrix}3251 \\{- 14093}\end{pmatrix},\begin{pmatrix}{- 14943} \\{- 5253}\end{pmatrix}$

[0400] Of course, there are many possible factorizations other thanthese two; finding an optimal one appears to be quite a challenge.

Using IIR Polynomial Division

[0401] As we have noted before, a factorization of a z-transform matrixinto elementary factors may be much longer than a factorization of aconstant matrix; in fact, there is no fixed bound on the length of thez-transform factorization. The main reason for this is that the entriesin the elementary matrices are quotients. We can divide by an arbitrarynonzero number, but we cannot divide by an arbitrary nonzero polynomialbecause only Laurent polynomials are allowed as entries in factormatrices. We will now look at what happens if we relax this restriction.

[0402] If $A = \begin{pmatrix}a_{11} & a_{12} \\a_{21} & a_{22}\end{pmatrix}$

[0403] has determinant 1 and we can divide by a₂₁, then we can factor Ainto three elementary matrices as described in the section entitled“Preserving Particular Lattice Points: $A = {\begin{pmatrix}1 & \frac{a_{11} - 1}{a_{21}} \\0 & 1\end{pmatrix}\begin{pmatrix}1 & 0 \\a_{21} & 1\end{pmatrix}{\begin{pmatrix}1 & \frac{a_{22} - 1}{a_{21}} \\0 & 1\end{pmatrix}.}}$

[0404] So when can we divide by a₂₁? Clearly we can if a₂₁ is a nonzeroconstant or monomial. In other cases the process will be IIR (infiniteimpulse response) rather than FIR (finite impulse response).

[0405] For instance, suppose a₂₁(z)=1−cz. Then${\frac{1}{a_{21}} = {1 + {cz} + {c^{2}z^{2}} + \ldots}}\quad,$

[0406] , so adding, say, $\frac{1}{a_{21}}$

[0407] times the second band to the first band would involve combiningentries from arbitrarily far in the past from the second band to updatethe current entry in the first band.

[0408] This also raises the issue of numerical stability. If |c|>1, thenwe are adding larger and larger multiples of older entries from thesecond band to the first band, and the process is quite unstable andleads to an unbounded result. If |c|<1, though, then the process isstable and the result of applying it to a bounded signal is still abounded signal.

[0409] If p(z) is a nonzero Laurent polynomial whose zeros (other than0) all have absolute value greater than 1, then p can be written as aproduct of a monomial and some number of factors of the form 1−cz with|c|<1. Since we can divide by each of these factors stably in turn, wecan divide by p stably.

[0410] In fact, the process of dividing by p is just the standardlong-division algorithm for (Laurent) polynonials, starting at thelow-degree end. There is no need to factor p into linear factors anddivide by them separately; the result is the same if one just performsthe long division by p directly. This also means that one does not haveto remember the entire past signal during the long division process; oneonly has to remember the current partial remainder, which is no longerthan p.

[0411] Let us now return to the polynomial 1−cz, and assume this timethat |c|<1, so we cannot simply perform the division as above. What wecan do instead is rewrite 1−cz as −cz(1−c⁻¹z⁻¹). The monomial −cz causesno difficulty, and, since |c|<1, the expression$\frac{1}{\left( {1 - {c^{- 1}z^{- 1}}} \right)} = {1 + {c^{- 1}z^{- 1}} + {c^{- 2}z^{- 2}} + \ldots}$

[0412] has decreasing coefficients and leads to a stable divisionalgorithm.

[0413] This corresponds to simply doing the long division in theopposite direction, starting from the high end (which is what onecommonly does with polynomial division anyway). Again one can handlemultiple factors of this form with a single long division. Of course, weare giving up on causality here.

[0414] In general, suppose a₂₁ is a Laurent polynomial which does nothave any (complex) zeros of absolute value 1. Then we can factor a₂₁into two parts, one having the zeros of absolute value greater than 1and the other having the zeros of absolute value less than l. This letsus express a₂₁ in the form m(z)p(z)q(z⁻¹), where m(z) is a monomial andp and q are polynomials (ordinary, not Laurent) with constant term 1whose zeros are all of absolute value greater than 1. Dividing by m(z)is easy; we divide by p(z) and q(z⁻¹) successively using IIR longdivision, with the first division proceeding from low to high degree andthe second division proceeding from high to low degree.

[0415] There are some special cases of interest. If a₂₁ is a symmetricLaurent polynomial (i.e., a₂₁(z)=a₂₁(z⁻¹)) which has no zeros ofabsolute value 1, then the polynomials p and q are actually equal; wecan absorb the monomial into the other factors and write a₂₁ in the formp(z)p(z⁻¹) for some polynomial p (although this will require complexcoefficients if a₂₁(1)<0). Another common situation is for the theLaurent polynomial a₂₁(z) to have one dominant coefficient whoseabsolute value is greater than the sum of the absolute values of all ofthe other coefficients. In this case, it is impossible for a₂₁ to have acomplex zero of absolute value 1, so we definitely can factor a₂₁ asabove for division purposes.

[0416] Finally, what if a₂₁ has a complex zero of absolute value 1? Thesimplest case is a₂₁=z−1. If a constant signal x is multiplied by a₂₁,the result will be zero no matter what the constant is. Hence, givena₂₁x, one cannot recover x; in other words, division by a₂₁ is not evenwell-defined. The same thing happens for any polynomial a₂₁ with a zerow such that |w|=1: there is a nonzero bounded signal x=( . . . w², w, 1,w⁻¹, w⁻², . . . ) such that a₂₁x is zero, so it is impossible to divideby a₂₁. (If a₂₁ has real coefficients but w is not real, then one cantake the real part of the signal x above to get a real nonzero signalannihilated by a₂₁.)

[0417] We have shown here that, in many cases, it is possible to use IIRinteger-approximable factors for a FIR linear transformation, and thatthis may require fewer factors than a FIR factorization (thus possiblygiving faster computation and lower error bounds). The main drawback ofthis method is that it requires processing the entire signal even if oneonly needs part of the output (say, a subinterval of the transformedsignal). The need for this is frequent enough that it is usuallyworthwhile to use FIR factors even when more of them are required.

Conclusion

[0418] We have shown that a z-transform matrix (for a perfectreconstruction FIR signal transformation) of determinant 1 can befactored into elementary matrices (liftings) in a variety of ways; thisallows us to find integer approximations to these factors (and hence tothe original transformation) with additional useful properties, such aslocality of interaction, causality, and/or preservation of constantinteger signals.

[0419] Just as in the fixed-length case, there are a number ofpossibilities here that remain to be explored, including additionalfactorizations of matrices and improved error analysis. Also, additionalstudy would be helpful for the case of more than two bands (as mentionedearlier, one can use unit triangular matrices instead of elementarymatrices and thus reduce the number of factors; algorithms for producingefficient factorizations into such factors would be quite useful) andfor multidimensional signals (where one cannot always factor atransformation into elementary matrices, and even when one can, such aswhen there are more than two bands, the number of such matrices may beexcessive).

The embodiments of the invention in which an exclusive property orprivilege is claimed are defined as follows:
 1. A method for generatinga first plurality of output data values by transforming a plurality ofinput data values using a computer, the first plurality of output datavalues approximating a second plurality of output data values, thesecond plurality of output data values generated by applying a lineartransform to the plurality of input data values, the method comprisingat least one step, the step being one of the following: rearranging atleast one data value in a plurality of current input data values;negating at least one data value in the plurality of current input datavalues; modifying at least one data value in the plurality of currentinput data values, each modified data value generated by applying alinear combination of unmodified values in the plurality of input datavalues to the at least one data value, the linear combination comprisedof an integer generated in a reproducible manner, the integer being fromone of a group consisting of a rounded integer and a converted integer;and a step which is equivalent to a successive combination of one ormore steps of the preceding three types.
 2. The method of claim 1wherein the first plurality of output data values are integers if theplurality of input data values are integers.
 3. The method of claim 2wherein the plurality of input data values can be reconstructed exactlyfrom the first plurality of output data values.
 4. The method of claim 1wherein the linear transform has a determinant, the determinant beinginvertible as one of a group consisting of an integer and an integerLaurent polynomial.
 5. The method of claim 1 wherein the lineartransform has a determinant, the determinant being invertible as one ofa group consisting of a real number and a real Laurent polynomial, andthe method further comprising rescaling at least one of a plurality ofbands in the linear transform.
 6. The method of claim 1 wherein thelinear transform is a fixed finite-dimensional linear transform.
 7. Themethod of claim 6 wherein the linear transform has a property that whenapplied to the plurality of input data values, the plurality of datavalues being zero except at one location, the second plurality of outputdata values generated by applying the linear transform are identical tothe plurality of input data values, and the method having the sameproperty.
 8. The method of claim 6 wherein the plurality of input datavalues includes an input integer plurality and the second plurality ofoutput data values includes an output integer plurality, the lineartransform mapping an integer multiple of the input integer plurality toan integer multiple of the integer output plurality, the integermultiple of the input integer plurality corresponding to the integermultiple of the integer output plurality, and the method mapping mappingthe integer multiple of the integer input plurality to the correspondinginteger multiple of theinteger output plurality.
 9. The method of claim6 wherein the linear transform is one of a plurality of RGB-to-YCbCrcolor transforms.
 10. The method of claim 6 wherein the linear transformis a RGB-to-YIQ color transform.
 11. The method of claim 6 wherein thestep of modifying the at least one of the plurality of input data valuescomprises: successively sweeping through a plurality of bands of inputdata values in a first direction; successively adding to each bandduring each successive sweep in the first direction the linearcombination of unmodified values in the plurality of input data values,the linear combination being a rounded linear combination of theplurality of input data values in preceding bands; successively sweepingthrough a plurality of bands in a second direction, the second directionbeing opposite to the first direction; successively adding to each bandduring each successive sweep in the second direction the linearcombination of unmodified values in the plurality of input data values,the linear combination being a rounded linear combination of theplurality of input data values in preceding bands; and adding to one ofthe bands the linear combination of unmodified values in the pluralityof input data values, the linear combination being a rounded linearcombination of the plurality of input data values in all remaining bandsof the generated matrix.
 12. The method of claim 11 wherein the step ofrearranging at least one of the plurality of input data values comprisespermuting a plurality of bands, the plurality of bands including theplurality of input data values, and wherein the step of modifying the atleast one of the plurality of input data values further includespermuting the plurality of bands after adding to one of the bands. 13.The method of claim 1 wherein the linear transform is a wavelettransform.
 14. The method of claim 13 wherein the linear transform has aproperty that when applied to the plurality of input data values, theplurality of data values being zero except at one location, the secondplurality of output data values generated by applying the lineartransform are identical to the plurality of input data values, and themethod having the same property.
 15. The method of claim 13 wherein theplurality of input data values includes an input integer plurality andthe second plurality of output data values includes an output integerplurality, the linear transform mapping an integer multiple of the inputinteger plurality to an integer multiple of the integer outputplurality, the integer multiple of the input integer pluralitycorresponding to the integer multiple of the integer output plurality,and the method mapping mapping the integer multiple of the integer inputplurality to the corresponding integer multiple of theinteger outputplurality.
 16. The method of claim 13 wherein the step of rearrangingthe at least one data value is performed on only adjacent data values inthe plurality of input data values.
 17. The method of claim 13 whereinthe step of modifying the at least one data value is performed on onlyadjacent data values in the plurality of input data values.
 18. Themethod of claim 13 wherein the wavelet transform is a 9-7 wavelettransform.
 19. A method of generating matrix factors for afinite-dimensional linear transform using a computer, each matrix factorrepresented by a symbol, the linear transform represented by data valuesstored in a linear transformation matrix having a nonzero determinant,the method comprising: applying a first LU-decomposition to the lineartransformation matrix; generating four matrices from theLU-decomposition of the linear transformation matrix, the four matricesrepresented by the symbols {tilde over (Π)}, {tilde over (Π)}₂, {tildeover (L)} and

and satisfying the relationship {tilde over (Π)}A{tilde over(Π)}₂={tilde over (L)}

, the symbol {tilde over (Π)} representing a first permutation matrix,the symbol {tilde over (Π)}₂ representing a second permutation matrix,the symbol {tilde over (L)} representing a lower triangular matrixhaving a unit diagonal, and the symbol

representing a first upper triangular matrix; generating a third matrixrepresented by the symbol Â from the linear transformation matrix A ,the third matrix having a plurality of rows and a determinant of 1;computing a signed permutation matrix Π from the linear transformationmatrix and the third matrix such that A=ΠÂ; generating a permuted lineartransformation matrix represented by the symbol A′ from the lineartransformation matrix, the permuted linear transformation having adeterminant of 1; computing a second upper triangular matrix representedby the symbol U₁ from the permuted linear transformation matrix and thethird matrix, the second upper triangular matrix having a plurality ofrows, all diagonal entries equal to 1, and all entries below a first rowequal to 0, the second upper triangular matrix satisfying therelationship Â=U₁A′; factoring the permuted linear transformation matrixinto a product including a lower triangular matrix and an uppertriangular matrix, the lower triangular matrix and the upper triangularmatrix each having a unit diagonal, the lower triangular matrixrepresented by the symbol L and the upper triangular matrix representedby the symbol U; and generating the matrix factors for the scaled lineartransformation matrix A, the matrix factors including at least the lowertriangular matrix L, theupper triangular matrix U, the second uppertriangular matrix U₁, and the signed permutation matrix Π, the lineartransformation matrix expressed as a product of the matrix factors. 20.The method of claim 19 further comprising removing a scaling factor fromthe linear transformation matrix before applying the firstLU-decomposition, the scaled linear transformation matrix having adeterminant of 1 and represented by the symbol A.
 21. The method ofclaim 19 wherein the linear transformation matrix has a determinantequal to one of the group consisting of 1 and −1, the scaling factor isnot removed from the linear transformation matrix, and the matrixfactors further include a scaling matrix.
 22. The method of claim 19wherein the scaled linear transformation matrix is an identity matrix ifeach of the matrix factors is an identity matrix.
 23. The method ofclaim 19 further including: generating a matrix comprised of a pluralityof input data values stored in a plurality of bands, the plurality ofinput data values including a plurality of nonzero data values on one ofthe plurality of bands in the generated matrix, the linear transform notmodifying the plurality of nonzero data values in the one band whenapplied to the plurality of input data values; and generating a modifiedlinear transformation matrix from the matrix factors generated for thescaled linear transformation matrix, the modified linear transformationmatrix storing a modified linear transform, the modified lineartransform matching the linear transform when the plurality of nonzerodata values in the one band are integer values.
 24. The method of claim19 further including: generating a matrix comprised of a plurality ofinput data values, the plurality of input data values including aone-dimensional range of integer input data values; and generating amodified linear transformation matrix from at least the matrix factorsgenerated for the scaled linear transformation matrix, the modifiedlinear transformation matrix storing a modified linear transform. 25.The method of claim 24 wherein: the linear transform maps theone-dimensional range of integer input data values to a plurality ofinteger output data values; and the modified linear transform maps theone-dimensional range of integer input data values to the plurality ofinteger output data values.
 26. A method of generating a sequence ofmatrix factors for a transformation matrix having a plurality of rowsand columns using a computer, the transformation matrix storing datavalues representing a wavelet transform, the method comprising: applyingat least one plurality of row reduction operations to the transformationmatrix; and generating the sequence of matrix factors from the reducedtransformation matrix and the row reduction operations.
 27. The methodof claim 26 futher includes removing a scaling factor from thetransformation matrix before applying the row reduction operations, thescaled transformation matrix having a determinant with a coefficientequal to one of the group consisting of 1 and −1.
 28. The method ofclaim 26 wherein the sequence of matrix factors includes a firstdiagonal matrix generated by extracting a permutation matrix from thereduced transformation matrix.
 29. The method of claim 26 wherein thefirst diagonal matrix has a nonzero determinant that is other than amonomial with coefficient in a group consisting of 1 and −1, the firstdiagonal matrix being factored into a plurality of elementary matricesand a second diagonal matrix, the second diagonal matrix having anonzero coefficient that is one of the group consisting of 1 and −1. 30.The method of claim 26 wherein each of the matrix factors in thesequence are reduced to a plurality of matrix factors satisfying alocality condition.
 31. The method of claim 26 wherein a plurality ofthe matrix factors in the sequence are combined to form at least onecompound factor.
 32. The method of claim 31 wherein the at least onecompound factor is a unit triangular matrix.
 33. The method of claim 26wherein the transformation matrix has a determinant having a coefficientthat is one of the group consisting of 1 and −1.
 34. The method of claim26 wherein the transformation matrix has a nonzero monomial determinantand the sequence of matrix factors further includes a scaling matrix.35. The method of claim 26 further including: generating a matrixcomprised of a plurality of input data values stored in the plurality ofrows and columns, each row and column being a band, the plurality ofinput data values including a plurality of nonzero data values on one ofthe bands, the wavelet transform not altering the nonzero data values inthe one band when applied to the generated matrix; and generating amodified transformation matrix from the sequence of matrix factorsproduced for the reduced transformation matrix, the modifiedtransformation matrix storing a modified wavelet transform, the modifiedwavelet transform matching the wavelet transform stored in thetransformation matrix when the plurality of nonzero data values in theone band are integer values.
 36. The method of claim 26 furtherincluding: generating a matrix comprised of a plurality of input datavalues, the plurality of input data values including a one-dimensionalrange of integer input data values; mapping the one-dimensional range ofinteger input data values to a plurality of integer output data values,the integer input data values mapped to a plurality of integer outputdata values by the wavelet transform stored in the transformationmatrix; and generating a modified transformation matrix from at leastthe sequence of matrix factors for the reduced transformation matrix,the modified transformation matrix storing a modified transform, themodified transform mapping the plurality of integer input data values tothe plurality of integer output data values.
 37. The method of claim 26wherein the sequence of matrix factors are generated from theapplication of the at least one plurality of row reduction operations toadjacent data values stored in the scaled transformation matrix.
 38. Themethod of claim 26 wherein each of the matrix factors are generated froma time-consecutive application of the at least one plurality of rowreduction operations to the scaled transformation matrix, eachtime-consecutive application applied to data stored in the matrix duringa current time-consecutive application and all data stored during atleast one prior time-consecutive application, the time-consecutiveapplication resulting in a causal generation of the matrix factors.